Pub Date : 2025-12-09DOI: 10.1109/TKDE.2025.3622154
Yuanyuan Yao;Yuhan Shi;Lu Chen;Ziquan Fang;Yunjun Gao;Leong Hou U;Yushuai Li;Tianyi Li
Multivariate time series (MTS) anomaly detection identifies abnormal patterns where each timestamp contains multiple variables. Existing MTS anomaly detection methods fall into three categories: reconstruction-based, prediction-based, and classifier-based methods. However, these methods face three key challenges: (1) Unsupervised learning methods, such as reconstruction-based and prediction-based methods, rely on error thresholds, which can lead to inaccuracies; (2) Semi-supervised methods mainly model normal dataand often underuse anomaly labels, limiting detection of subtle anomalies; (3) Supervised learning methods, such as classifier-based approaches, often fail to capture local relationships, incur high computational costs, and are constrained by the scarcity of labeled data. To address these limitations, we propose Moon, a supervised modality conversion-based multivariate time series anomaly detection framework. Moon enhances the efficiency and accuracy of anomaly detection while providing detailed anomaly analysis reports. First, Moon introduces a novel multivariate Markov Transition Field (MV-MTF) technique to convert numeric time series data into image representations, capturing relationships across variables and timestamps. Since numeric data retains unique patterns that cannot be fully captured by image conversion alone, Moon employs a Multimodal-CNN to integrate numeric and image data through a feature fusion model with parameter sharing, enhancing training efficiency. Finally, a SHAP-based anomaly explainer identifies key variables contributing to anomalies, improving interpretability. Extensive experiments on six real-world MTS datasets demonstrate that Moon outperforms six state-of-the-art methods by up to 93% in efficiency, 4% in accuracy and, 10.8% in interpretation performance.
{"title":"Moon: A Modality Conversion-Based Efficient Multivariate Time Series Anomaly Detection","authors":"Yuanyuan Yao;Yuhan Shi;Lu Chen;Ziquan Fang;Yunjun Gao;Leong Hou U;Yushuai Li;Tianyi Li","doi":"10.1109/TKDE.2025.3622154","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3622154","url":null,"abstract":"Multivariate time series (MTS) anomaly detection identifies abnormal patterns where each timestamp contains multiple variables. Existing MTS anomaly detection methods fall into three categories: reconstruction-based, prediction-based, and classifier-based methods. However, these methods face three key challenges: (1) Unsupervised learning methods, such as reconstruction-based and prediction-based methods, rely on error thresholds, which can lead to inaccuracies; (2) Semi-supervised methods mainly model normal dataand often underuse anomaly labels, limiting detection of subtle anomalies; (3) Supervised learning methods, such as classifier-based approaches, often fail to capture local relationships, incur high computational costs, and are constrained by the scarcity of labeled data. To address these limitations, we propose <sc>Moon</small>, a supervised modality conversion-based multivariate time series anomaly detection framework. <sc>Moon</small> enhances the efficiency and accuracy of anomaly detection while providing detailed anomaly analysis reports. First, <sc>Moon</small> introduces a novel multivariate Markov Transition Field (MV-MTF) technique to convert numeric time series data into image representations, capturing relationships across variables and timestamps. Since numeric data retains unique patterns that cannot be fully captured by image conversion alone, <sc>Moon</small> employs a Multimodal-CNN to integrate numeric and image data through a feature fusion model with parameter sharing, enhancing training efficiency. Finally, a SHAP-based anomaly explainer identifies key variables contributing to anomalies, improving interpretability. Extensive experiments on six real-world MTS datasets demonstrate that <sc>Moon</small> outperforms six state-of-the-art methods by up to 93% in efficiency, 4% in accuracy and, 10.8% in interpretation performance.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 1","pages":"457-474"},"PeriodicalIF":10.4,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TKDE.2025.3639413
Tianyue Ren;Zhibang Yang;Yan Ding;Xu Zhou;Kenli Li;Yunjun Gao;Keqin Li
Spatial crowdsourcing (SC) is becoming increasingly popular recently. As a critical issue in SC, task assignment currently faces challenges due to the imbalanced spatiotemporal distribution of tasks. Hence, many related studies and applications focusing on cross-platform task allocation in SC have emerged. Existing work primarily focuses on the maximization of total revenue for inner platform in cross task assignment. In this work, we formulate a SC problem called Cross Dynamic Task Assignment (CDTA) to maximize the overall utility and propose improved solutions aiming at creating a win-win situation for inner platform, task requesters, and outer workers. We first design a hybrid batch processing framework and a novel cross-platform incentive mechanism. Then, with the purpose of allocating tasks to both inner and outer workers, we present a KM-based algorithm that gets the accurate assignment result in each batch and a density-aware greedy algorithm with high efficiency. To maximize the revenue of inner platform and outer workers simultaneously, we model the competition among outer workers as a potential game that is shown to have at least one pure Nash equilibrium and develop a game-theoretic method. Additionally, a simulated annealing-based improved algorithm is proposed to avoid falling into local optima. Last but not least, since random thresholds lead to unstable results when picking tasks that are preferentially assigned to inner workers, we devise an adaptive threshold selection algorithm based on multi-armed bandit to further improve the overall utility. Extensive experiments demonstrate the effectiveness and efficiency of our proposed algorithms on both real and synthetic datasets.
{"title":"Win-Win Approaches for Cross Dynamic Task Assignment in Spatial Crowdsourcing","authors":"Tianyue Ren;Zhibang Yang;Yan Ding;Xu Zhou;Kenli Li;Yunjun Gao;Keqin Li","doi":"10.1109/TKDE.2025.3639413","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3639413","url":null,"abstract":"Spatial crowdsourcing (SC) is becoming increasingly popular recently. As a critical issue in SC, task assignment currently faces challenges due to the imbalanced spatiotemporal distribution of tasks. Hence, many related studies and applications focusing on cross-platform task allocation in SC have emerged. Existing work primarily focuses on the maximization of total revenue for inner platform in cross task assignment. In this work, we formulate a SC problem called Cross Dynamic Task Assignment (CDTA) to maximize the overall utility and propose improved solutions aiming at creating a win-win situation for inner platform, task requesters, and outer workers. We first design a hybrid batch processing framework and a novel cross-platform incentive mechanism. Then, with the purpose of allocating tasks to both inner and outer workers, we present a KM-based algorithm that gets the accurate assignment result in each batch and a density-aware greedy algorithm with high efficiency. To maximize the revenue of inner platform and outer workers simultaneously, we model the competition among outer workers as a potential game that is shown to have at least one pure Nash equilibrium and develop a game-theoretic method. Additionally, a simulated annealing-based improved algorithm is proposed to avoid falling into local optima. Last but not least, since random thresholds lead to unstable results when picking tasks that are preferentially assigned to inner workers, we devise an adaptive threshold selection algorithm based on multi-armed bandit to further improve the overall utility. Extensive experiments demonstrate the effectiveness and efficiency of our proposed algorithms on both real and synthetic datasets.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1395-1411"},"PeriodicalIF":10.4,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TKDE.2025.3639418
Shidan Ma;Yan Ding;Xu Zhou;Peng Peng;Youhuan Li;Zhibang Yang;Kenli Li
Graph pattern queries (GPQ) over RDF graphs extend basic graph patterns to support variable-length paths (VLP), thereby enabling complex knowledge retrieval and navigation. Generally, variable-length paths describe the reachability between two vertices via a given property within a specified range. With the increasing scale of RDF graphs, it is necessary to design a partitioning method to achieve efficient distributed queries. Although many partitioning strategies have been proposed for large RDF graphs, most existing methods result in numerous inter-partition joins when processing GPQs, which impacts query performance. In this paper, we formulate a new partitioning problem, MaxLocJoin, aims to minimize inter-partition joins during distributed GPQ processing. For MaxLocJoin, we propose a partitioning framework (PIP) based on property-induced subgraphs, which consist of edges with a specific set of properties. The framework first finds a locally joinable property set using a cost-driven algorithm, LJPS, where the cost depends on the sizes of weakly connected components within its property-induced subgraphs. Subsequently, the graph is partitioned according to the weakly connected components. The framework can achieve two key objectives: first, it enables complete local processing of all variable-length path queries (eliminating inter-partition joins); second, it can minimize the number of inter-partition joins required for traditional graph pattern queries. Moreover, we identify two types of independently executable queries (IEQ): the locally joinable IEQ and the single-property IEQ. After that, a query decomposition algorithm is designed to transform all GPQ into one of them for independent execution in distributed environments. In experiments, we implement two prototype systems based on Jena and Virtuoso, and evaluate them over both real and synthetic RDF graphs. The results show that MaxLocJoin achieves performance improvements from 2.8x to 10.7x over existing methods.
{"title":"Property-Induced Partitioning for Graph Pattern Queries on Distributed RDF Systems","authors":"Shidan Ma;Yan Ding;Xu Zhou;Peng Peng;Youhuan Li;Zhibang Yang;Kenli Li","doi":"10.1109/TKDE.2025.3639418","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3639418","url":null,"abstract":"Graph pattern queries (GPQ) over RDF graphs extend basic graph patterns to support variable-length paths (VLP), thereby enabling complex knowledge retrieval and navigation. Generally, variable-length paths describe the reachability between two vertices via a given property within a specified range. With the increasing scale of RDF graphs, it is necessary to design a partitioning method to achieve efficient distributed queries. Although many partitioning strategies have been proposed for large RDF graphs, most existing methods result in numerous inter-partition joins when processing GPQs, which impacts query performance. In this paper, we formulate a new partitioning problem, MaxLocJoin, aims to minimize inter-partition joins during distributed GPQ processing. For MaxLocJoin, we propose a partitioning framework (PIP) based on property-induced subgraphs, which consist of edges with a specific set of properties. The framework first finds a locally joinable property set using a cost-driven algorithm, LJPS, where the cost depends on the sizes of weakly connected components within its property-induced subgraphs. Subsequently, the graph is partitioned according to the weakly connected components. The framework can achieve two key objectives: first, it enables complete local processing of all variable-length path queries (eliminating inter-partition joins); second, it can minimize the number of inter-partition joins required for traditional graph pattern queries. Moreover, we identify two types of independently executable queries (IEQ): the locally joinable IEQ and the single-property IEQ. After that, a query decomposition algorithm is designed to transform all GPQ into one of them for independent execution in distributed environments. In experiments, we implement two prototype systems based on Jena and Virtuoso, and evaluate them over both real and synthetic RDF graphs. The results show that MaxLocJoin achieves performance improvements from 2.8x to 10.7x over existing methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1249-1263"},"PeriodicalIF":10.4,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Truth discovery has emerged as an effective tool to mitigate data inconsistency in crowdsensing by prioritizing data from high-quality responders. While local differential privacy (LDP) has emerged as a crucial privacy-preserving paradigm, existing studies under LDP rarely explore a worker’s participation in specific tasks for sparse scenarios, which may also reveal sensitive information such as individual preferences and behaviors. Existing LDP mechanisms, when applied to truth discovery in sparse settings, may create undesirable dense distributions, provide insufficient privacy protection, and introduce excessive noise, compromising the efficacy of subsequent non-private truth discovery. Additionally, the interplay between noise injection and truth discovery remains insufficiently explored in the current literature. To address these issues, we propose a lOcally differentially private truth diSCovery approach for spArse cRowdsensing, namely OSCAR. The main idea is to use advanced optimization techniques to reconstruct the sparse data distribution and re-formalize truth discovery by considering the statistical characteristics of injected Laplacian noise while protecting the privacy of both the tasks being completed and the corresponding sensory data. Specifically, to address the data density concerns while alleviating noise, we design a randomized response-based Bernoulli matrix factorization method BerRR. To recover the sparse structures from densified, perturbed data, we formalize a 0-1 integer programming problem and develop a sparse recovery solving method SpaIE based on implicit enumeration. We further devise a Laplacian-sensitive truth discovery method LapCRH that leverages maximum likelihood estimation to re-formalize truth discovery by measuring differences between noisy values and truths based on the statistical characteristic of Laplacian noise. Our comprehensive theoretical analysis establishes OSCAR’s privacy guarantees, utility bounds, and computational complexity. Experimental results show that OSCAR surpasses the state-of-the-arts by at least 30% in accuracy improvement.
{"title":"Locally Differentially Private Truth Discovery for Sparse Crowdsensing","authors":"Pengfei Zhang;Zhikun Zhang;Yang Cao;Xiang Cheng;Youwen Zhu;Zhiquan Liu;Ji Zhang","doi":"10.1109/TKDE.2025.3639070","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3639070","url":null,"abstract":"Truth discovery has emerged as an effective tool to mitigate data inconsistency in crowdsensing by prioritizing data from high-quality responders. While local differential privacy (LDP) has emerged as a crucial privacy-preserving paradigm, existing studies under LDP rarely explore a worker’s participation in specific tasks for sparse scenarios, which may also reveal sensitive information such as individual preferences and behaviors. Existing LDP mechanisms, when applied to truth discovery in sparse settings, may create undesirable dense distributions, provide insufficient privacy protection, and introduce excessive noise, compromising the efficacy of subsequent non-private truth discovery. Additionally, the interplay between noise injection and truth discovery remains insufficiently explored in the current literature. To address these issues, we propose a l<italic>O</i>cally differentially private truth di<italic>SC</i>overy approach for sp<italic>A</i>rse c<bold>R</b>owdsensing, namely <italic>OSCAR</i>. The main idea is to use advanced optimization techniques to reconstruct the sparse data distribution and re-formalize truth discovery by considering the statistical characteristics of injected Laplacian noise while protecting the privacy of both the tasks being completed and the corresponding sensory data. Specifically, to address the data density concerns while alleviating noise, we design a randomized response-based Bernoulli matrix factorization method <italic>BerRR</i>. To recover the sparse structures from densified, perturbed data, we formalize a 0-1 integer programming problem and develop a sparse recovery solving method <italic>SpaIE</i> based on implicit enumeration. We further devise a Laplacian-sensitive truth discovery method <italic>LapCRH</i> that leverages maximum likelihood estimation to re-formalize truth discovery by measuring differences between noisy values and truths based on the statistical characteristic of Laplacian noise. Our comprehensive theoretical analysis establishes <italic>OSCAR</i>’s privacy guarantees, utility bounds, and computational complexity. Experimental results show that <italic>OSCAR</i> surpasses the state-of-the-arts by at least 30% in accuracy improvement.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1189-1205"},"PeriodicalIF":10.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the proliferation of GPS-equipped edge devices, huge trajectory data are generated and accumulated in various domains, driving numerous urban applications. However, due to the limited data acquisition capabilities of edge devices, many trajectories are often recorded at low sampling rates, reducing the effectiveness of these applications. To address this issue, we aim to recover high-sample-rate trajectories from low-sample-rate ones enhancing the usability of trajectory data. Recent approaches to trajectory recovery often assume centralized data storage, which can lead to catastrophic forgetting, where previously learned knowledge is entirely forgotten when new data arrives. This not only poses privacy risks but also degrades performance in decentralized settings where data streams into the system incrementally. To enable decentralized training and streaming trajectory recovery, we propose a Lightweight incremental framework for federated Trajectory Recovery, called LightTR+, which is based on a client-server architecture. Given the limited processing capabilities of edge devices, LightTR+ includes a lightweight local trajectory embedding module that enhances computational efficiency without compromising feature extraction capabilities. To mitigate catastrophic forgetting, we propose an intra-domain knowledge distillation module. Additionally, LightTR+ features a meta-knowledge enhanced local-global training scheme, which reduces communication costs between the server and clients, further improving efficiency. Extensive experiments offer insight into the effectiveness and efficiency of LightTR+.
{"title":"LightTR+: A Lightweight Incremental Framework for Federated Trajectory Recovery","authors":"Hao Miao;Ziqiao Liu;Yan Zhao;Chenxi Liu;Chenjuan Guo;Bin Yang;Kai Zheng;Huan Li;Christian S. Jensen","doi":"10.1109/TKDE.2025.3638888","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3638888","url":null,"abstract":"With the proliferation of GPS-equipped edge devices, huge trajectory data are generated and accumulated in various domains, driving numerous urban applications. However, due to the limited data acquisition capabilities of edge devices, many trajectories are often recorded at low sampling rates, reducing the effectiveness of these applications. To address this issue, we aim to recover high-sample-rate trajectories from low-sample-rate ones enhancing the usability of trajectory data. Recent approaches to trajectory recovery often assume centralized data storage, which can lead to catastrophic forgetting, where previously learned knowledge is entirely forgotten when new data arrives. This not only poses privacy risks but also degrades performance in decentralized settings where data streams into the system incrementally. To enable decentralized training and streaming trajectory recovery, we propose a <underline>Light</u>weight incremental framework for federated <underline>T</u>rajectory <underline>R</u>ecovery, called LightTR+, which is based on a client-server architecture. Given the limited processing capabilities of edge devices, LightTR+ includes a lightweight local trajectory embedding module that enhances computational efficiency without compromising feature extraction capabilities. To mitigate catastrophic forgetting, we propose an intra-domain knowledge distillation module. Additionally, LightTR+ features a meta-knowledge enhanced local-global training scheme, which reduces communication costs between the server and clients, further improving efficiency. Extensive experiments offer insight into the effectiveness and efficiency of LightTR+.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1174-1188"},"PeriodicalIF":10.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1109/TKDE.2025.3639074
Xiang Wu;Rong-Hua Li;Zhaoxin Fan;Kai Chen;Yujin Gao;Hongchao Qin;Guoren Wang
Temporal interactions form the crux of numerous real-world scenarios, thus necessitating effective modeling in temporal graph representation learning. Despite extensive research within this domain, we identify a significant oversight in current methodologies: the temporal-spatial dynamics in graphs, encompassing both structural and temporal coherence, remain largely unaddressed. In an effort to bridge this research gap, we present a novel framework termed Graph Representation learning enhanced by Periodic and Community Interactions (GRPCI). GRPCI consists of two primary mechanisms devised explicitly to tackle the aforementioned challenge. Firstly, to utilize latent temporal dynamics, we propose a novel periodicity-based neighborhood aggregation mechanism that underscores neighbors engaged in a periodic interaction pattern. This mechanism seamlessly integrates the element of periodicity into the model. Secondly, to exploit structural dynamics, we design a novel contrastive-based local community representation learning mechanism. This mechanism features a heuristic dynamic contrastive pair sampling strategy aimed at enhancing the modeling of the latent distribution of local communities within the graphs. Through the incorporation of these two mechanisms, GRPCI markedly augments the performance of graph networks. Empirical evaluations, conducted via a temporal link prediction task across five real-life datasets, attest to the superior performance of GRPCI in comparison to existing state-of-the-art methodologies. The results of this study validate the efficacy of GRPCI, thereby establishing a new benchmark for future research in the field of temporal graph representation learning. Our findings underscore the importance of considering both temporal and structural consistency in temporal graph learning, and advocate for further exploration of this paradigm.
{"title":"GRPCI: Harnessing Temporal-Spatial Dynamics for Graph Representation Learning","authors":"Xiang Wu;Rong-Hua Li;Zhaoxin Fan;Kai Chen;Yujin Gao;Hongchao Qin;Guoren Wang","doi":"10.1109/TKDE.2025.3639074","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3639074","url":null,"abstract":"Temporal interactions form the crux of numerous real-world scenarios, thus necessitating effective modeling in temporal graph representation learning. Despite extensive research within this domain, we identify a significant oversight in current methodologies: the temporal-spatial dynamics in graphs, encompassing both structural and temporal coherence, remain largely unaddressed. In an effort to bridge this research gap, we present a novel framework termed Graph Representation learning enhanced by Periodic and Community Interactions (GRPCI). GRPCI consists of two primary mechanisms devised explicitly to tackle the aforementioned challenge. Firstly, to utilize latent temporal dynamics, we propose a novel periodicity-based neighborhood aggregation mechanism that underscores neighbors engaged in a periodic interaction pattern. This mechanism seamlessly integrates the element of periodicity into the model. Secondly, to exploit structural dynamics, we design a novel contrastive-based local community representation learning mechanism. This mechanism features a heuristic dynamic contrastive pair sampling strategy aimed at enhancing the modeling of the latent distribution of local communities within the graphs. Through the incorporation of these two mechanisms, GRPCI markedly augments the performance of graph networks. Empirical evaluations, conducted via a temporal link prediction task across five real-life datasets, attest to the superior performance of GRPCI in comparison to existing state-of-the-art methodologies. The results of this study validate the efficacy of GRPCI, thereby establishing a new benchmark for future research in the field of temporal graph representation learning. Our findings underscore the importance of considering both temporal and structural consistency in temporal graph learning, and advocate for further exploration of this paradigm.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1144-1158"},"PeriodicalIF":10.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1109/TKDE.2025.3638864
Yunxiao Zhao;Zhiqiang Wang;Xingtong Yu;Xiaoli Li;Jiye Liang;Ru Li
Rationalization, a data-centric framework, aims to build self-explanatory models to explain the prediction outcome by generating a subset of human-intelligible pieces of the input data. It involves a cooperative game model where a generator generates the most human-intelligible parts of the input (i.e., rationales), followed by a predictor that makes predictions based on these generated rationales. Conventional rationalization methods typically impose constraints via regularization terms to calibrate or penalize undesired generation. However, these methods are suffering from a problem called mode collapse, in which the predictor produces correct predictions yet the generator consistently outputs rationales with collapsed patterns. Moreover, existing studies are typically designed separately for specific collapsed patterns, lacking a unified consideration. In this paper, we systematically revisit cooperative rationalization from a novel game-theoretic perspective and identify the fundamental cause of this problem: the generator no longer tends to explore new strategies to uncover informative rationales, ultimately leading the system to converge to a suboptimal game equilibrium (correct predictions versus collapsed rationales). To solve this problem, we then propose a novel approach, Game-theoretic Policy Optimization oriented RATionalization (PoRat), which progressively introduces policy interventions to address the game equilibrium in the cooperative game process, thereby guiding the model toward a more optimal solution state. We theoretically analyse the cause of such a suboptimal equilibrium and prove the feasibility of the proposed method. Furthermore, we validate our method on nine widely used real-world datasets and two synthetic settings, where PoRat achieves up to 8.1% performance improvements over existing state-of-the-art methods.
{"title":"Learnable Game-Theoretic Policy Optimization for Data-Centric Self-Explanation Rationalization","authors":"Yunxiao Zhao;Zhiqiang Wang;Xingtong Yu;Xiaoli Li;Jiye Liang;Ru Li","doi":"10.1109/TKDE.2025.3638864","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3638864","url":null,"abstract":"Rationalization, a data-centric framework, aims to build self-explanatory models to explain the prediction outcome by generating a subset of human-intelligible pieces of the input data. It involves a cooperative game model where a generator generates the most human-intelligible parts of the input (i.e., rationales), followed by a predictor that makes predictions based on these generated rationales. Conventional rationalization methods typically impose constraints via regularization terms to calibrate or penalize undesired generation. However, these methods are suffering from a problem called mode collapse, in which the predictor produces correct predictions yet the generator consistently outputs rationales with collapsed patterns. Moreover, existing studies are typically designed separately for specific collapsed patterns, lacking a unified consideration. In this paper, we systematically revisit cooperative rationalization from a novel game-theoretic perspective and identify the fundamental cause of this problem: the generator no longer tends to explore new strategies to uncover informative rationales, ultimately leading the system to converge to a suboptimal game equilibrium (correct predictions <italic>versus</i> collapsed rationales). To solve this problem, we then propose a novel approach, Game-theoretic <bold>P</b>olicy <bold>O</b>ptimization oriented <bold>RAT</b>ionalization (<sc>PoRat</small>), which progressively introduces policy interventions to address the game equilibrium in the cooperative game process, thereby guiding the model toward a more optimal solution state. We theoretically analyse the cause of such a suboptimal equilibrium and prove the feasibility of the proposed method. Furthermore, we validate our method on nine widely used real-world datasets and two synthetic settings, where <sc>PoRat</small> achieves up to 8.1% performance improvements over existing state-of-the-art methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1159-1173"},"PeriodicalIF":10.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1109/TKDE.2025.3638465
Fan Li;Xiaoyang Wang;Dawei Cheng;Wenjie Zhang;Chen Chen;Ying Zhang;Xuemin Lin
With growing demands for data privacy and model robustness, graph unlearning (GU), which erases the influence of specific data on trained GNN models, has gained significant attention. However, existing exact unlearning methods suffer from either low efficiency or poor model performance. While more utility-preserving and efficient, current approximate methods require access to the forget set during unlearning, which makes them inapplicable in immediate deletion scenarios, thereby undermining privacy. Additionally, these approximate methods, which attempt to directly perturb model parameters, still raise significant concerns regarding unlearning power in empirical studies. To fill the gap, we propose Transferable Condensation Graph Unlearning (TCGU), a data-centric solution to graph unlearning. Specifically, we first develop a two-level alignment strategy to pre-condense the original graph into a compact yet utility-preserving dataset for subsequent unlearning tasks. Upon receiving an unlearning request, we fine-tune the pre-condensed data with a low-rank plugin, to directly align its distribution with the remaining graph, thus efficiently revoking the information of deleted data without accessing them. A novel similarity distribution matching approach and a discrimination regularizer are proposed to effectively transfer condensed data and preserve its utility in GNN training, respectively. Finally, we retrain the GNN on the transferred condensed data. Extensive experiments on 7 benchmark datasets demonstrate that TCGU can achieve superior performance in terms of model utility, unlearning efficiency, and unlearning efficacy compared to existing GU methods. To the best of our knowledge, this is the first study to explore graph unlearning with immediate data removal using a data-centric approximate method.
{"title":"TCGU: Data-Centric Graph Unlearning Based on Transferable Condensation","authors":"Fan Li;Xiaoyang Wang;Dawei Cheng;Wenjie Zhang;Chen Chen;Ying Zhang;Xuemin Lin","doi":"10.1109/TKDE.2025.3638465","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3638465","url":null,"abstract":"With growing demands for data privacy and model robustness, graph unlearning (GU), which erases the influence of specific data on trained GNN models, has gained significant attention. However, existing exact unlearning methods suffer from either low efficiency or poor model performance. While more utility-preserving and efficient, current approximate methods require access to the forget set during unlearning, which makes them inapplicable in immediate deletion scenarios, thereby undermining privacy. Additionally, these approximate methods, which attempt to directly perturb model parameters, still raise significant concerns regarding unlearning power in empirical studies. To fill the gap, we propose Transferable Condensation Graph Unlearning (TCGU), a data-centric solution to graph unlearning. Specifically, we first develop a two-level alignment strategy to pre-condense the original graph into a compact yet utility-preserving dataset for subsequent unlearning tasks. Upon receiving an unlearning request, we fine-tune the pre-condensed data with a low-rank plugin, to directly align its distribution with the remaining graph, thus efficiently revoking the information of deleted data without accessing them. A novel similarity distribution matching approach and a discrimination regularizer are proposed to effectively transfer condensed data and preserve its utility in GNN training, respectively. Finally, we retrain the GNN on the transferred condensed data. Extensive experiments on 7 benchmark datasets demonstrate that TCGU can achieve superior performance in terms of model utility, unlearning efficiency, and unlearning efficacy compared to existing GU methods. To the best of our knowledge, this is the first study to explore graph unlearning with immediate data removal using a data-centric approximate method.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1334-1348"},"PeriodicalIF":10.4,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1109/TKDE.2025.3638343
Miaomiao Cai;Lei Chen;Yifan Wang;Zhiyong Cheng;Min Zhang;Meng Wang
Popularity bias is a common challenge in recommender systems. It often causes unbalanced item recommendation performance and intensifies the Matthew effect. Due to limited user-item interactions, unpopular items are frequently constrained to the embedding neighborhoods of only a few users, leading to representation collapse and weakening the model’s generalization. Although existing supervised alignment and reweighting methods can help mitigate this problem, they still face two major limitations: (1) they overlook the inherent variability among different Graph Convolutional Networks (GCNs) layers, which can result in negative gains in deeper layers; (2) they rely heavily on fixed hyperparameters to balance popular and unpopular items, limiting adaptability to diverse data distributions and increasing model complexity. To address these challenges, we propose Graph-Structured Dual Adaptation Framework (GSDA), a dual adaptive framework for mitigating popularity bias in recommendation. Our theoretical analysis shows that supervised alignment in GCNs is hindered by the over-smoothing effect, where the distinction between popular and unpopular items diminishes as layers deepen, reducing the effectiveness of alignment at deeper levels. To overcome this limitation, GSDA integrates a hierarchical adaptive alignment mechanism that counteracts entropy decay across layers together with a distribution-aware contrastive weighting strategy based on the Gini coefficient, enabling the model to adapt its debiasing strength dynamically without relying on fixed hyperparameters. Extensive experiments on three benchmark datasets demonstrate that GSDA effectively alleviates popularity bias while consistently outperforming state-of-the-art methods in recommendation performance.
{"title":"Graph-Structured Driven Dual Adaptation for Mitigating Popularity Bias","authors":"Miaomiao Cai;Lei Chen;Yifan Wang;Zhiyong Cheng;Min Zhang;Meng Wang","doi":"10.1109/TKDE.2025.3638343","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3638343","url":null,"abstract":"Popularity bias is a common challenge in recommender systems. It often causes unbalanced item recommendation performance and intensifies the Matthew effect. Due to limited user-item interactions, unpopular items are frequently constrained to the embedding neighborhoods of only a few users, leading to representation collapse and weakening the model’s generalization. Although existing supervised alignment and reweighting methods can help mitigate this problem, they still face two major limitations: (1) they overlook the inherent variability among different Graph Convolutional Networks (GCNs) layers, which can result in negative gains in deeper layers; (2) they rely heavily on fixed hyperparameters to balance popular and unpopular items, limiting adaptability to diverse data distributions and increasing model complexity. To address these challenges, we propose <italic><b><u>G</u>raph-<u>S</u>tructured <u>D</u>ual <u>A</u>daptation Framework (GSDA)</b></i>, a dual adaptive framework for mitigating popularity bias in recommendation. Our theoretical analysis shows that supervised alignment in GCNs is hindered by the over-smoothing effect, where the distinction between popular and unpopular items diminishes as layers deepen, reducing the effectiveness of alignment at deeper levels. To overcome this limitation, <italic>GSDA</i> integrates a hierarchical adaptive alignment mechanism that counteracts entropy decay across layers together with a distribution-aware contrastive weighting strategy based on the Gini coefficient, enabling the model to adapt its debiasing strength dynamically without relying on fixed hyperparameters. Extensive experiments on three benchmark datasets demonstrate that <italic>GSDA</i> effectively alleviates popularity bias while consistently outperforming state-of-the-art methods in recommendation performance.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1129-1143"},"PeriodicalIF":10.4,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatio-temporal traffic data imputation is a fundamental component in intelligent transportation systems, which can significantly improve data quality and enhance the accuracy of downstream data mining tasks. Recently, low-rank tensor representation has shown great potential for spatio-temporal traffic data imputation. However, the low-rank assumption focuses on the global structure, neglecting the critical spatial topology and local temporal dependencies inherent in spatio-temporal data. To address these issues, we propose a topology-induced low-rank tensor representation (TILR), which can accurately capture the underlying low-rankness of the spatial multi-scale features induced by topology knowledge. Moreover, to exploit local temporal dependencies, we suggest a learnable convolutional regularization framework, which not only includes some classical convolution-based regularizers but also leads to the discovery of new convolutional regularizers. Equipped with the suggested TILR and convolutional regularizer, we build a unified low-rank tensor model harmonizing spatial topology and temporal dependencies for traffic data imputation, which is expected to deliver promising performance even under extreme and complex missing scenarios. To solve the proposed nonconvex model, we develop an efficient alternating direction method of multipliers (ADMM)-based algorithm and analyze its computational complexity. Extensive experiments demonstrate that the proposed model outperforms state-of-the-art baselines for various missing scenarios. These results reveal the critical synergy between topology-aware low-rank constraint and temporal dynamic modeling for spatio-temporal data imputation.
{"title":"Topology-Induced Low-Rank Tensor Representation for Spatio-Temporal Traffic Data Imputation","authors":"Zhi-Long Han;Ting-Zhu Huang;Xi-Le Zhao;Ben-Zheng Li;Meng Ding","doi":"10.1109/TKDE.2025.3638633","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3638633","url":null,"abstract":"Spatio-temporal traffic data imputation is a fundamental component in intelligent transportation systems, which can significantly improve data quality and enhance the accuracy of downstream data mining tasks. Recently, low-rank tensor representation has shown great potential for spatio-temporal traffic data imputation. However, the low-rank assumption focuses on the global structure, neglecting the critical spatial topology and local temporal dependencies inherent in spatio-temporal data. To address these issues, we propose a topology-induced low-rank tensor representation (TILR), which can accurately capture the underlying low-rankness of the spatial multi-scale features induced by topology knowledge. Moreover, to exploit local temporal dependencies, we suggest a learnable convolutional regularization framework, which not only includes some classical convolution-based regularizers but also leads to the discovery of new convolutional regularizers. Equipped with the suggested TILR and convolutional regularizer, we build a unified low-rank tensor model harmonizing spatial topology and temporal dependencies for traffic data imputation, which is expected to deliver promising performance even under extreme and complex missing scenarios. To solve the proposed nonconvex model, we develop an efficient alternating direction method of multipliers (ADMM)-based algorithm and analyze its computational complexity. Extensive experiments demonstrate that the proposed model outperforms state-of-the-art baselines for various missing scenarios. These results reveal the critical synergy between topology-aware low-rank constraint and temporal dynamic modeling for spatio-temporal data imputation.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1349-1363"},"PeriodicalIF":10.4,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}