Pub Date : 2025-11-11DOI: 10.1109/TKDE.2025.3631376
He Zhang;Shuang Wang;Long Chen;Xiaoping Li;Qing Gao;Quan Z. Sheng
In the era of Big Data and generative artificial intelligence (AI), discovering the truth about various objects from different sources has become a pressing topic. Existing studies primarily focus on dependent sources with conflicting information, where sources may copy information from each other. However, real-world scenarios are often more complex, with dynamic dependence relationships among sources over time. This complexity makes it much more difficult to discover the truth. One of the key challenges centers on measuring the dynamic dependence among sources. To address this challenge, we have developed three models: $Depen_{S}imple$, $Depen_{C}omplex$, and $Depen_{D}ynamic$. These models are based on the Hidden Markov Model (HMM) and are designed to handle different types of dependencies, namely simple source dependence, complex source dependence, and dynamic source dependence. Based on the constructed models, we propose a generic framework for discovering the latent truth which are evaluated by three HMM-based methods. We conduct extensive experiments on three real-world datasets to evaluate the performance of the proposed methods, and the results demonstrate that all three methods achieve high accuracy over the state-of-the-art methods.
{"title":"Reliable Truth Discovery for Dynamic and Dependent Sources","authors":"He Zhang;Shuang Wang;Long Chen;Xiaoping Li;Qing Gao;Quan Z. Sheng","doi":"10.1109/TKDE.2025.3631376","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3631376","url":null,"abstract":"In the era of Big Data and generative artificial intelligence (AI), discovering the truth about various objects from different sources has become a pressing topic. Existing studies primarily focus on dependent sources with conflicting information, where sources may copy information from each other. However, real-world scenarios are often more complex, with dynamic dependence relationships among sources over time. This complexity makes it much more difficult to discover the truth. One of the key challenges centers on measuring the dynamic dependence among sources. To address this challenge, we have developed three models: <inline-formula><tex-math>$Depen_{S}imple$</tex-math></inline-formula>, <inline-formula><tex-math>$Depen_{C}omplex$</tex-math></inline-formula>, and <inline-formula><tex-math>$Depen_{D}ynamic$</tex-math></inline-formula>. These models are based on the Hidden Markov Model (HMM) and are designed to handle different types of dependencies, namely <i>simple source dependence</i>, <i>complex source dependence</i>, and <i>dynamic source dependence</i>. Based on the constructed models, we propose a generic framework for discovering the latent truth which are evaluated by three HMM-based methods. We conduct extensive experiments on three real-world datasets to evaluate the performance of the proposed methods, and the results demonstrate that all three methods achieve high accuracy over the state-of-the-art methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 1","pages":"546-558"},"PeriodicalIF":10.4,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-hop Knowledge Graph Reasoning (KGR) seeks to identify accurate answers within Knowledge Graphs (KGs) via multi-step reasoning, predominantly utilizing reinforcement learning (RL) to enhance the efficiency of the reasoning process. Unlike traditional Knowledge Graph Embedding (KGE) methods, RL-based approaches offer superior interpretability. However, these methods often underperform due to two critical limitations: (1) their over-reliance on Horn rules for reasoning paths, which restricts their expressive power; and (2) inadequate utilization of reasoning states during the process. To address these issues, we propose a novel RL-based framework, RAR, which shifts focus from individual paths to subgraph structures for more robust predictions. RAR frames the retrieval of reasoning subgraphs from the KG as a Markov Decision Process (MDP) and incorporates a subgraph retriever. To efficiently explore the extensive subgraph space, we integrate multi-agent RL to enhance the retriever’s capabilities. Additionally, RAR features an advanced analyst module that meticulously examines reasoning states. These modules function iteratively: the retriever expands the subgraph, followed by the analyst module’s in-depth analysis. The insights gained are then used to inform subsequent retrieval steps. Ultimately, the predicted scores from both modules are synthesized to produce more precise posterior scores. Experimental results across multiple datasets demonstrate RAR’s efficacy, showcasing a notable improvement over existing state-of-the-art RL-based KGR methods.
{"title":"Subgraph-Centric Multi-Agent Reinforcement Learning for Multi-Hop Knowledge Graph Reasoning","authors":"Tao He;Zerui Chen;Lizi Liao;Yixin Cao;Yuanxing Liu;Wei Tang;Xun Mao;Kai Lv;Ming Liu;Bing Qin","doi":"10.1109/TKDE.2025.3631495","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3631495","url":null,"abstract":"Multi-hop Knowledge Graph Reasoning (KGR) seeks to identify accurate answers within Knowledge Graphs (KGs) via multi-step reasoning, predominantly utilizing reinforcement learning (RL) to enhance the efficiency of the reasoning process. Unlike traditional Knowledge Graph Embedding (KGE) methods, RL-based approaches offer superior interpretability. However, these methods often underperform due to two critical limitations: (1) their over-reliance on Horn rules for reasoning paths, which restricts their expressive power; and (2) inadequate utilization of reasoning states during the process. To address these issues, we propose a novel RL-based framework, RAR, which shifts focus from individual paths to subgraph structures for more robust predictions. RAR frames the retrieval of reasoning subgraphs from the KG as a Markov Decision Process (MDP) and incorporates a subgraph retriever. To efficiently explore the extensive subgraph space, we integrate multi-agent RL to enhance the retriever’s capabilities. Additionally, RAR features an advanced analyst module that meticulously examines reasoning states. These modules function iteratively: the retriever expands the subgraph, followed by the analyst module’s in-depth analysis. The insights gained are then used to inform subsequent retrieval steps. Ultimately, the predicted scores from both modules are synthesized to produce more precise posterior scores. Experimental results across multiple datasets demonstrate RAR’s efficacy, showcasing a notable improvement over existing state-of-the-art RL-based KGR methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1319-1333"},"PeriodicalIF":10.4,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-10DOI: 10.1109/TKDE.2025.3631025
Meng Chen;Hongwei Jia;Zechen Li;Weiming Huang;Kai Zhao;Yongshun Gong;Haoran Xu;Hongjun Dai
A recent trend in urban computing involves utilizing multi-modal data for urban region embedding, which can be further expanded in a variety of downstream urban sensing tasks. Many previous studies rely on multi-graph embedding techniques and follow a two-stage paradigm: first building a $k$-nearest neighbor graph based on fixed region correlations for each view, and then blending multi-view information in a posterior stage to learn region representations. However, multi-graph construction and multi-graph representation learning are not associated in most existing two-stage studies, and the relationship between them is not leveraged, which can provide complementary information to each other. In this paper, we unify these two stages into one by constructing learnable weighted complete graphs of regions and propose a new one-stage Region Embedding method with Adaptive region correlation Discovery (READ). Specifically, READ comprises three modules, including a disentangled region feature learning module utilizing a city-context Transformer to encode regions’ semantic and mobility features, and an adaptive weighted multi-graph construction module that builds multiple complete graphs with learnable weights based on disentangled features of regions. In addition, we propose a multi-graph representation learning module to yield effective region representations that integrate information from multiple graphs. We conduct thorough experiments on three downstream tasks to assess READ. Experimental results demonstrate that READ considerably outperforms state-of-the-art baseline methods in urban region embedding.
{"title":"Region Embedding With Adaptive Correlation Discovery for Predicting Urban Socioeconomic Indicators","authors":"Meng Chen;Hongwei Jia;Zechen Li;Weiming Huang;Kai Zhao;Yongshun Gong;Haoran Xu;Hongjun Dai","doi":"10.1109/TKDE.2025.3631025","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3631025","url":null,"abstract":"A recent trend in urban computing involves utilizing multi-modal data for urban region embedding, which can be further expanded in a variety of downstream urban sensing tasks. Many previous studies rely on multi-graph embedding techniques and follow a two-stage paradigm: first building a <inline-formula><tex-math>$k$</tex-math></inline-formula>-nearest neighbor graph based on fixed region correlations for each view, and then blending multi-view information in a posterior stage to learn region representations. However, multi-graph construction and multi-graph representation learning are not associated in most existing two-stage studies, and the relationship between them is not leveraged, which can provide complementary information to each other. In this paper, we unify these two stages into one by constructing learnable weighted complete graphs of regions and propose a new one-stage Region Embedding method with Adaptive region correlation Discovery (READ). Specifically, READ comprises three modules, including a disentangled region feature learning module utilizing a city-context Transformer to encode regions’ semantic and mobility features, and an adaptive weighted multi-graph construction module that builds multiple complete graphs with learnable weights based on disentangled features of regions. In addition, we propose a multi-graph representation learning module to yield effective region representations that integrate information from multiple graphs. We conduct thorough experiments on three downstream tasks to assess READ. Experimental results demonstrate that READ considerably outperforms state-of-the-art baseline methods in urban region embedding.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1280-1291"},"PeriodicalIF":10.4,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-10DOI: 10.1109/TKDE.2025.3631112
Jiaqi Jiang;Rong-Hua Li;Longlong Lin;Yalong Zhang;Yue Zeng;Xiaowei Ye;Guoren Wang
Important communities are densely connected subgraphs containing vertices with high importance values, which have received wide attention recently. However, existing methods, predominantly based on the $k$-core model, suffer from limitations such as rigid degree constraints and suboptimal density, often failing to capture highly important vertices. To address these limitations, we propose a new community model based on pseudoarboricity that guarantees near-optimal density while preserving important vertices. Further, we introduce a novel problem of Psudoarboricity-based Skyline Important Community (PSIC), which uniquely treats density and importance as independent attributes. To efficiently address PSIC, we first devise a basic algorithm $mathsf {ClimbStairs}$, which iteratively refines communities by peeling vertices with low importance. To boost efficiency, we develop an advanced algorithm $mathsf {DivAndCon}$, which employs a recursive divide-and-conquer strategy combined with weight-based and pseudoarboricity-based pruning techniques, significantly reducing the search space. For massive graphs with billions of edges, inspired by a recursive division tree, we develop several parallel algorithms utilizing thread-pool and free-synchronization mechanism. Finally, we conduct extensive experiments on 10 real-world networks, and the results demonstrate the superiority of our solutions in terms of effectiveness, efficiency, and scalability.
{"title":"Pseudoarboricity-Based Skyline Important Community Search in Large Networks","authors":"Jiaqi Jiang;Rong-Hua Li;Longlong Lin;Yalong Zhang;Yue Zeng;Xiaowei Ye;Guoren Wang","doi":"10.1109/TKDE.2025.3631112","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3631112","url":null,"abstract":"Important communities are densely connected subgraphs containing vertices with high importance values, which have received wide attention recently. However, existing methods, predominantly based on the <inline-formula><tex-math>$k$</tex-math></inline-formula>-core model, suffer from limitations such as rigid degree constraints and suboptimal density, often failing to capture highly important vertices. To address these limitations, we propose a new community model based on pseudoarboricity that guarantees near-optimal density while preserving important vertices. Further, we introduce a novel problem of Psudoarboricity-based Skyline Important Community (PSIC), which uniquely treats density and importance as independent attributes. To efficiently address PSIC, we first devise a basic algorithm <inline-formula><tex-math>$mathsf {ClimbStairs}$</tex-math></inline-formula>, which iteratively refines communities by peeling vertices with low importance. To boost efficiency, we develop an advanced algorithm <inline-formula><tex-math>$mathsf {DivAndCon}$</tex-math></inline-formula>, which employs a recursive divide-and-conquer strategy combined with weight-based and pseudoarboricity-based pruning techniques, significantly reducing the search space. For massive graphs with billions of edges, inspired by a recursive division tree, we develop several parallel algorithms utilizing thread-pool and free-synchronization mechanism. Finally, we conduct extensive experiments on 10 real-world networks, and the results demonstrate the superiority of our solutions in terms of effectiveness, efficiency, and scalability.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1264-1279"},"PeriodicalIF":10.4,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Identifying locally dense communities closely connected to the user-initiated query node is crucial for a wide range of applications. Existing approaches either solely depend on rule-based constraints or exclusively utilize deep learning technologies to identify target communities. Therefore, an important question is proposed: can deep learning be integrated with rule-based constraints to elevate the quality of community search? In this paper, we affirmatively address this question by introducing a novel approach called Neural Community Search via Attribute-augmented Conductance, abbreviated as NCSAC. Specifically, NCSAC first proposes a novel concept of attribute-augmented conductance, which harmoniously blends the (internal and external) structural proximity and the attribute similarity. Then, NCSAC extracts a coarse candidate community of satisfactory quality using the proposed attribute-augmented conductance. Subsequently, NCSAC frames the community search as a graph optimization task, refining the candidate community through sophisticated reinforcement learning techniques, thereby producing high-quality results. Extensive experiments on six real-world graphs and ten competitors demonstrate the superiority of our solutions in terms of accuracy, efficiency, and scalability. Notably, the proposed solution outperforms state-of-the-art methods, achieving an impressive F1-score improvement ranging from 5.3% to 42.4%.
{"title":"NCSAC: Effective Neural Community Search via Attribute-Augmented Conductance","authors":"Longlong Lin;Quanao Li;Miao Qiao;Zeli Wang;Jin Zhao;Rong-Hua Li;Xin Luo;Tao Jia","doi":"10.1109/TKDE.2025.3630626","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3630626","url":null,"abstract":"Identifying locally dense communities closely connected to the user-initiated query node is crucial for a wide range of applications. Existing approaches either solely depend on rule-based constraints or exclusively utilize deep learning technologies to identify target communities. Therefore, an important question is proposed: can deep learning be integrated with rule-based constraints to elevate the quality of community search? In this paper, we affirmatively address this question by introducing a novel approach called Neural Community Search via Attribute-augmented Conductance, abbreviated as NCSAC. Specifically, NCSAC first proposes a novel concept of attribute-augmented conductance, which harmoniously blends the (internal and external) structural proximity and the attribute similarity. Then, NCSAC extracts a coarse candidate community of satisfactory quality using the proposed attribute-augmented conductance. Subsequently, NCSAC frames the community search as a graph optimization task, refining the candidate community through sophisticated reinforcement learning techniques, thereby producing high-quality results. Extensive experiments on six real-world graphs and ten competitors demonstrate the superiority of our solutions in terms of accuracy, efficiency, and scalability. Notably, the proposed solution outperforms state-of-the-art methods, achieving an impressive F1-score improvement ranging from 5.3% to 42.4%.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1221-1235"},"PeriodicalIF":10.4,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As a pivotal variant of multi-label classification, hierarchical text classification (HTC) faces unique challenges due to its intricate taxonomic hierarchy. Recent state-of-the-art approaches improve performance by considering both global hierarchy covering all labels and local hierarchy indicating substructure of sample-specific ground-truth labels. However, they often over-condense hierarchical information into one or several tokens, which may cause the loss of useful knowledge. Accordingly, we propose a dual classifier model with global and local hierarchies (DCGL). It adopts prompt tuning-based BERT as the backbone, where global hierarchy is integrated into the soft prompt template. And this resulting classifier branch is termed global pipeline. To mitigate information loss caused by hierarchy condensation, we introduce a parallel local hierarchy-aware classifier pipeline. This local pipeline acquires label-level classification features through text propagation on the label hierarchy and aligns these features with oracle label representations of local hierarchy via graph contrastive learning, which serve as a novel strategy for local hierarchy incorporation. Thereby, DCGL obtains more granular and targeted features and captures local hierarchy information such as label co-occurrence and local structure. Moreover, since global and local pipelines capture distinct yet complementary information, we further apply mutual knowledge distillation to bridge the gap between their output logits and facilitate mutual learning. And to better control the distillation degree, we design a dynamic temperature negatively correlated with label confidence. Comprehensive experiments demonstrate that our DCGL outperforms several representative HTC methods.
{"title":"Exploring Global and Local Hierarchies: Dual Classifier With Mutual Distillation for Hierarchical Text Classification","authors":"Yubin Li;Zhaojian Cui;Haokai Gao;Jiale Liu;Yuncheng Jiang","doi":"10.1109/TKDE.2025.3629743","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3629743","url":null,"abstract":"As a pivotal variant of multi-label classification, hierarchical text classification (HTC) faces unique challenges due to its intricate taxonomic hierarchy. Recent state-of-the-art approaches improve performance by considering both global hierarchy covering all labels and local hierarchy indicating substructure of sample-specific ground-truth labels. However, they often over-condense hierarchical information into one or several tokens, which may cause the loss of useful knowledge. Accordingly, we propose a dual classifier model with global and local hierarchies (DCGL). It adopts prompt tuning-based BERT as the backbone, where global hierarchy is integrated into the soft prompt template. And this resulting classifier branch is termed global pipeline. To mitigate information loss caused by hierarchy condensation, we introduce a parallel local hierarchy-aware classifier pipeline. This local pipeline acquires label-level classification features through text propagation on the label hierarchy and aligns these features with oracle label representations of local hierarchy via graph contrastive learning, which serve as a novel strategy for local hierarchy incorporation. Thereby, DCGL obtains more granular and targeted features and captures local hierarchy information such as label co-occurrence and local structure. Moreover, since global and local pipelines capture distinct yet complementary information, we further apply mutual knowledge distillation to bridge the gap between their output logits and facilitate mutual learning. And to better control the distillation degree, we design a dynamic temperature negatively correlated with label confidence. Comprehensive experiments demonstrate that our DCGL outperforms several representative HTC methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1084-1098"},"PeriodicalIF":10.4,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1109/TKDE.2025.3629816
Wenyu Fang;Wei Huang;Yanhua Liu;Jia Liu;Tianrui Li
Multi-party urban flow analysis is a crucial task in smart cities. However, existing analysis methods has difficulty in trade-off between data privacy security and spatio-temporal feature capture. The solution to the problem of how to capture the complete spatio-temporal features of multi-party urban flow data while protecting data privacy is of great importance in multi-party urban flow analysis. Therefore, to address data privacy and spatio-temporal feature capture in multi-party urban flow analysis, this paper proposes a spatio-temporal federated analysis model, for multi-party urban flow mining, which is able to effectively protect data privacy and capture spatio-temporal features completely at the same time. First, a multi-party urban flow mining framework based on federated learning is proposed to realize complete capture of spatio-temporal feature information of multi-party urban flow data and mining urban flow pattern knowledge under the premise of protecting data privacy. Second, to address the communication cost of the multi-party urban flow analysis, we propose a lazy aggregation method based on similarity clustering, which improves the communication efficiency between clients and the server. Further, we propose a similarity evaluation criteria for urban flow data based on step function, which can effectively calculate the similarity between urban flow data. Finally, we compare the proposed model with some benchmark methods on Chengdu Didi order data and point of interest data to prove the effectiveness of the proposed model and visualize and analyze the spatio-temporal features.
{"title":"Multi-Party Federated Urban Flow Mining and Analysis Based on Lazy Aggregation","authors":"Wenyu Fang;Wei Huang;Yanhua Liu;Jia Liu;Tianrui Li","doi":"10.1109/TKDE.2025.3629816","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3629816","url":null,"abstract":"Multi-party urban flow analysis is a crucial task in smart cities. However, existing analysis methods has difficulty in trade-off between data privacy security and spatio-temporal feature capture. The solution to the problem of how to capture the complete spatio-temporal features of multi-party urban flow data while protecting data privacy is of great importance in multi-party urban flow analysis. Therefore, to address data privacy and spatio-temporal feature capture in multi-party urban flow analysis, this paper proposes a spatio-temporal federated analysis model, for multi-party urban flow mining, which is able to effectively protect data privacy and capture spatio-temporal features completely at the same time. First, a multi-party urban flow mining framework based on federated learning is proposed to realize complete capture of spatio-temporal feature information of multi-party urban flow data and mining urban flow pattern knowledge under the premise of protecting data privacy. Second, to address the communication cost of the multi-party urban flow analysis, we propose a lazy aggregation method based on similarity clustering, which improves the communication efficiency between clients and the server. Further, we propose a similarity evaluation criteria for urban flow data based on step function, which can effectively calculate the similarity between urban flow data. Finally, we compare the proposed model with some benchmark methods on Chengdu Didi order data and point of interest data to prove the effectiveness of the proposed model and visualize and analyze the spatio-temporal features.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 1","pages":"475-488"},"PeriodicalIF":10.4,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-05DOI: 10.1109/TKDE.2025.3621708
Chang Cui;Yongqiang Tang;Yuxun Qu;Wensheng Zhang
Survival analysis is extensively employed to analyze the probability of the event of interest, particularly in the medical field. Most current research treats patients as isolated entities, neglecting the complex associations among them, which leads to underutilization of valuable information. Recently, several studies address this limitation by incorporating patient graph structures. However, these approaches generally overlook two critical issues: 1) the exploration of heterogeneous inter-patient relationships, and 2) flexible and scalable inductive inference for test samples. To overcome these challenges, this study introduces a novel framework, Multiplex Graph Guided Deep Survival Analysis (MGG-Surv). Specifically, we employ multiplex patient graphs to capture comprehensive inter-patient associative information. Furthermore, we propose a teacher-student dual network architecture, where the teacher network encodes multiplex graphs, and the learned graph knowledge is transferred to the student network via a unidirectional connection termed Graph-Guided Distillation. The student network integrates this graph knowledge to predict survival outcomes without requiring the patient graphs. These innovative designs facilitate comprehensive integration of inter-patient relationships while achieving flexible and scalable graph-free inference. Experiments on four datasets, encompassing both single and competing risks, demonstrate the superior performance of our framework.
生存分析被广泛用于分析感兴趣的事件的概率,特别是在医学领域。目前大多数研究将患者视为孤立的个体,忽视了他们之间的复杂联系,导致有价值的信息没有得到充分利用。最近,一些研究通过合并患者图结构来解决这一限制。然而,这些方法通常忽略了两个关键问题:1)对异质患者间关系的探索,以及2)对测试样本的灵活和可扩展的归纳推理。为了克服这些挑战,本研究引入了一个新的框架,即Multiplex Graph Guided Deep Survival Analysis (MGG-Surv)。具体来说,我们采用多路患者图来捕获全面的患者间关联信息。此外,我们提出了一种师生双网络架构,其中教师网络编码多路图,学习到的图知识通过称为图引导蒸馏的单向连接传输到学生网络。学生网络整合了这些图表知识来预测生存结果,而不需要病人的图表。这些创新的设计促进了患者间关系的全面集成,同时实现了灵活和可扩展的无图推理。在四个数据集上的实验,包括单一和竞争风险,证明了我们的框架的优越性能。
{"title":"Multiplex Graph Guided Deep Survival Analysis","authors":"Chang Cui;Yongqiang Tang;Yuxun Qu;Wensheng Zhang","doi":"10.1109/TKDE.2025.3621708","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3621708","url":null,"abstract":"Survival analysis is extensively employed to analyze the probability of the event of interest, particularly in the medical field. Most current research treats patients as isolated entities, neglecting the complex associations among them, which leads to underutilization of valuable information. Recently, several studies address this limitation by incorporating patient graph structures. However, these approaches generally overlook two critical issues: 1) the exploration of heterogeneous inter-patient relationships, and 2) flexible and scalable inductive inference for test samples. To overcome these challenges, this study introduces a novel framework, Multiplex Graph Guided Deep Survival Analysis (MGG-Surv). Specifically, we employ multiplex patient graphs to capture comprehensive inter-patient associative information. Furthermore, we propose a teacher-student dual network architecture, where the teacher network encodes multiplex graphs, and the learned graph knowledge is transferred to the student network via a unidirectional connection termed Graph-Guided Distillation. The student network integrates this graph knowledge to predict survival outcomes without requiring the patient graphs. These innovative designs facilitate comprehensive integration of inter-patient relationships while achieving flexible and scalable graph-free inference. Experiments on four datasets, encompassing both single and competing risks, demonstrate the superior performance of our framework.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 1","pages":"489-505"},"PeriodicalIF":10.4,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While stock prediction task traditionally relies on volume-price and fundamental data to predict the return ratio or price movement trend, sentiment factors derived from social media platforms such as StockTwits offer a complementary and useful source of real-time market information. However, we find that most social media posts, along with the public sentiment they reflect, provide limited value for trading predictions due to their noisy nature. To tackle this, we propose a novel dynamic expert tracing algorithm that filters out non-informative posts and identifies both true and inverse experts whose consistent predictions can serve as valuable trading signals. Our approach achieves significant improvements over existing expert identification methods in stock trend prediction. However, when using binary expert predictions to predict the return ratio, similar to all other expert identification methods, our approach faces a common challenge of signal sparsity with expert signals cover only about 4% of all stock-day combinations in our dataset. To address this challenge, we propose a dual graph attention neural network that effectively propagates expert signals across related stocks, enabling accurate prediction of return ratios and significantly increasing signal coverage. Empirical results show that our propagated expert-based signals not only exhibit strong predictive power independently but also work synergistically with traditional financial features. These combined signals significantly outperform representative baseline models in all quant-related metrics including predictive accuracy, return metrics, and correlation metrics, resulting in more robust investment strategies. We hope this work inspires further research into leveraging social media data for enhancing quantitative investment strategies.
{"title":"Unleashing Expert Opinion From Social Media for Stock Prediction","authors":"Wanyun Zhou;Saizhuo Wang;Xiang Li;Yiyan Qi;Jian Guo;Xiaowen Chu","doi":"10.1109/TKDE.2025.3626439","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3626439","url":null,"abstract":"While stock prediction task traditionally relies on volume-price and fundamental data to predict the return ratio or price movement trend, sentiment factors derived from social media platforms such as StockTwits offer a complementary and useful source of real-time market information. However, we find that most social media posts, along with the public sentiment they reflect, provide limited value for trading predictions due to their noisy nature. To tackle this, we propose a novel dynamic expert tracing algorithm that filters out non-informative posts and identifies both true and inverse experts whose consistent predictions can serve as valuable trading signals. Our approach achieves significant improvements over existing expert identification methods in stock trend prediction. However, when using binary expert predictions to predict the return ratio, similar to all other expert identification methods, our approach faces a common challenge of signal sparsity with expert signals cover only about 4% of all stock-day combinations in our dataset. To address this challenge, we propose a dual graph attention neural network that effectively propagates expert signals across related stocks, enabling accurate prediction of return ratios and significantly increasing signal coverage. Empirical results show that our propagated expert-based signals not only exhibit strong predictive power independently but also work synergistically with traditional financial features. These combined signals significantly outperform representative baseline models in all quant-related metrics including predictive accuracy, return metrics, and correlation metrics, resulting in more robust investment strategies. We hope this work inspires further research into leveraging social media data for enhancing quantitative investment strategies.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1380-1394"},"PeriodicalIF":10.4,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weakly-Supervised Anomaly Detection (WSAD) has garnered increasing research interest in recent years, as it enables superior detection performance while demanding only a small fraction of labeled data. However, existing WSAD methods face two major limitations. From the data aspect, they struggle to detect anomalies between normal clusters or collective anomalies due to overlooking the multi-distribution and complex manifolds of real-world data. From the label aspect, they fall short of detecting unknown anomalies because of the label-insufficiency and anomaly contamination. To address these issues, we propose MMM, a unified WSAD framework for multi-distributional data. The framework consists of three components: a Multi-distribution data modeler captures latent representations of complex data distributions, followed by a Multiform feature extractor that extracts multiple underlying features from the modeler, highlighting the characteristics of potential anomalies. Finally, a Multi-strategy anomaly score estimator converts these features into anomaly scores, with the aid of a novel training approach with three strategies that maximize the utility of both data and labels. Experimental results showed that MMM achieved superior performance and robustness compared to state-of-the-art WSAD methods, while providing interpretable results that facilitate practical anomaly analysis.
{"title":"MMM: A Unified Weakly-Supervised Anomaly Detection Framework for Multi-Distributional Data","authors":"Xu Tan;Junqi Chen;Jiawei Yang;Jie Chen;Susanto Rahardja","doi":"10.1109/TKDE.2025.3626561","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3626561","url":null,"abstract":"Weakly-Supervised Anomaly Detection (WSAD) has garnered increasing research interest in recent years, as it enables superior detection performance while demanding only a small fraction of labeled data. However, existing WSAD methods face two major limitations. From the data aspect, they struggle to detect anomalies between normal clusters or collective anomalies due to overlooking the multi-distribution and complex manifolds of real-world data. From the label aspect, they fall short of detecting unknown anomalies because of the label-insufficiency and anomaly contamination. To address these issues, we propose MMM, a unified WSAD framework for multi-distributional data. The framework consists of three components: a Multi-distribution data modeler captures latent representations of complex data distributions, followed by a Multiform feature extractor that extracts multiple underlying features from the modeler, highlighting the characteristics of potential anomalies. Finally, a Multi-strategy anomaly score estimator converts these features into anomaly scores, with the aid of a novel training approach with three strategies that maximize the utility of both data and labels. Experimental results showed that MMM achieved superior performance and robustness compared to state-of-the-art WSAD methods, while providing interpretable results that facilitate practical anomaly analysis.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 1","pages":"442-456"},"PeriodicalIF":10.4,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}