Pub Date : 2025-11-10DOI: 10.1109/TKDE.2025.3631025
Meng Chen;Hongwei Jia;Zechen Li;Weiming Huang;Kai Zhao;Yongshun Gong;Haoran Xu;Hongjun Dai
A recent trend in urban computing involves utilizing multi-modal data for urban region embedding, which can be further expanded in a variety of downstream urban sensing tasks. Many previous studies rely on multi-graph embedding techniques and follow a two-stage paradigm: first building a $k$-nearest neighbor graph based on fixed region correlations for each view, and then blending multi-view information in a posterior stage to learn region representations. However, multi-graph construction and multi-graph representation learning are not associated in most existing two-stage studies, and the relationship between them is not leveraged, which can provide complementary information to each other. In this paper, we unify these two stages into one by constructing learnable weighted complete graphs of regions and propose a new one-stage Region Embedding method with Adaptive region correlation Discovery (READ). Specifically, READ comprises three modules, including a disentangled region feature learning module utilizing a city-context Transformer to encode regions’ semantic and mobility features, and an adaptive weighted multi-graph construction module that builds multiple complete graphs with learnable weights based on disentangled features of regions. In addition, we propose a multi-graph representation learning module to yield effective region representations that integrate information from multiple graphs. We conduct thorough experiments on three downstream tasks to assess READ. Experimental results demonstrate that READ considerably outperforms state-of-the-art baseline methods in urban region embedding.
{"title":"Region Embedding With Adaptive Correlation Discovery for Predicting Urban Socioeconomic Indicators","authors":"Meng Chen;Hongwei Jia;Zechen Li;Weiming Huang;Kai Zhao;Yongshun Gong;Haoran Xu;Hongjun Dai","doi":"10.1109/TKDE.2025.3631025","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3631025","url":null,"abstract":"A recent trend in urban computing involves utilizing multi-modal data for urban region embedding, which can be further expanded in a variety of downstream urban sensing tasks. Many previous studies rely on multi-graph embedding techniques and follow a two-stage paradigm: first building a <inline-formula><tex-math>$k$</tex-math></inline-formula>-nearest neighbor graph based on fixed region correlations for each view, and then blending multi-view information in a posterior stage to learn region representations. However, multi-graph construction and multi-graph representation learning are not associated in most existing two-stage studies, and the relationship between them is not leveraged, which can provide complementary information to each other. In this paper, we unify these two stages into one by constructing learnable weighted complete graphs of regions and propose a new one-stage Region Embedding method with Adaptive region correlation Discovery (READ). Specifically, READ comprises three modules, including a disentangled region feature learning module utilizing a city-context Transformer to encode regions’ semantic and mobility features, and an adaptive weighted multi-graph construction module that builds multiple complete graphs with learnable weights based on disentangled features of regions. In addition, we propose a multi-graph representation learning module to yield effective region representations that integrate information from multiple graphs. We conduct thorough experiments on three downstream tasks to assess READ. Experimental results demonstrate that READ considerably outperforms state-of-the-art baseline methods in urban region embedding.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1280-1291"},"PeriodicalIF":10.4,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-10DOI: 10.1109/TKDE.2025.3631112
Jiaqi Jiang;Rong-Hua Li;Longlong Lin;Yalong Zhang;Yue Zeng;Xiaowei Ye;Guoren Wang
Important communities are densely connected subgraphs containing vertices with high importance values, which have received wide attention recently. However, existing methods, predominantly based on the $k$-core model, suffer from limitations such as rigid degree constraints and suboptimal density, often failing to capture highly important vertices. To address these limitations, we propose a new community model based on pseudoarboricity that guarantees near-optimal density while preserving important vertices. Further, we introduce a novel problem of Psudoarboricity-based Skyline Important Community (PSIC), which uniquely treats density and importance as independent attributes. To efficiently address PSIC, we first devise a basic algorithm $mathsf {ClimbStairs}$, which iteratively refines communities by peeling vertices with low importance. To boost efficiency, we develop an advanced algorithm $mathsf {DivAndCon}$, which employs a recursive divide-and-conquer strategy combined with weight-based and pseudoarboricity-based pruning techniques, significantly reducing the search space. For massive graphs with billions of edges, inspired by a recursive division tree, we develop several parallel algorithms utilizing thread-pool and free-synchronization mechanism. Finally, we conduct extensive experiments on 10 real-world networks, and the results demonstrate the superiority of our solutions in terms of effectiveness, efficiency, and scalability.
{"title":"Pseudoarboricity-Based Skyline Important Community Search in Large Networks","authors":"Jiaqi Jiang;Rong-Hua Li;Longlong Lin;Yalong Zhang;Yue Zeng;Xiaowei Ye;Guoren Wang","doi":"10.1109/TKDE.2025.3631112","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3631112","url":null,"abstract":"Important communities are densely connected subgraphs containing vertices with high importance values, which have received wide attention recently. However, existing methods, predominantly based on the <inline-formula><tex-math>$k$</tex-math></inline-formula>-core model, suffer from limitations such as rigid degree constraints and suboptimal density, often failing to capture highly important vertices. To address these limitations, we propose a new community model based on pseudoarboricity that guarantees near-optimal density while preserving important vertices. Further, we introduce a novel problem of Psudoarboricity-based Skyline Important Community (PSIC), which uniquely treats density and importance as independent attributes. To efficiently address PSIC, we first devise a basic algorithm <inline-formula><tex-math>$mathsf {ClimbStairs}$</tex-math></inline-formula>, which iteratively refines communities by peeling vertices with low importance. To boost efficiency, we develop an advanced algorithm <inline-formula><tex-math>$mathsf {DivAndCon}$</tex-math></inline-formula>, which employs a recursive divide-and-conquer strategy combined with weight-based and pseudoarboricity-based pruning techniques, significantly reducing the search space. For massive graphs with billions of edges, inspired by a recursive division tree, we develop several parallel algorithms utilizing thread-pool and free-synchronization mechanism. Finally, we conduct extensive experiments on 10 real-world networks, and the results demonstrate the superiority of our solutions in terms of effectiveness, efficiency, and scalability.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1264-1279"},"PeriodicalIF":10.4,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Identifying locally dense communities closely connected to the user-initiated query node is crucial for a wide range of applications. Existing approaches either solely depend on rule-based constraints or exclusively utilize deep learning technologies to identify target communities. Therefore, an important question is proposed: can deep learning be integrated with rule-based constraints to elevate the quality of community search? In this paper, we affirmatively address this question by introducing a novel approach called Neural Community Search via Attribute-augmented Conductance, abbreviated as NCSAC. Specifically, NCSAC first proposes a novel concept of attribute-augmented conductance, which harmoniously blends the (internal and external) structural proximity and the attribute similarity. Then, NCSAC extracts a coarse candidate community of satisfactory quality using the proposed attribute-augmented conductance. Subsequently, NCSAC frames the community search as a graph optimization task, refining the candidate community through sophisticated reinforcement learning techniques, thereby producing high-quality results. Extensive experiments on six real-world graphs and ten competitors demonstrate the superiority of our solutions in terms of accuracy, efficiency, and scalability. Notably, the proposed solution outperforms state-of-the-art methods, achieving an impressive F1-score improvement ranging from 5.3% to 42.4%.
{"title":"NCSAC: Effective Neural Community Search via Attribute-Augmented Conductance","authors":"Longlong Lin;Quanao Li;Miao Qiao;Zeli Wang;Jin Zhao;Rong-Hua Li;Xin Luo;Tao Jia","doi":"10.1109/TKDE.2025.3630626","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3630626","url":null,"abstract":"Identifying locally dense communities closely connected to the user-initiated query node is crucial for a wide range of applications. Existing approaches either solely depend on rule-based constraints or exclusively utilize deep learning technologies to identify target communities. Therefore, an important question is proposed: can deep learning be integrated with rule-based constraints to elevate the quality of community search? In this paper, we affirmatively address this question by introducing a novel approach called Neural Community Search via Attribute-augmented Conductance, abbreviated as NCSAC. Specifically, NCSAC first proposes a novel concept of attribute-augmented conductance, which harmoniously blends the (internal and external) structural proximity and the attribute similarity. Then, NCSAC extracts a coarse candidate community of satisfactory quality using the proposed attribute-augmented conductance. Subsequently, NCSAC frames the community search as a graph optimization task, refining the candidate community through sophisticated reinforcement learning techniques, thereby producing high-quality results. Extensive experiments on six real-world graphs and ten competitors demonstrate the superiority of our solutions in terms of accuracy, efficiency, and scalability. Notably, the proposed solution outperforms state-of-the-art methods, achieving an impressive F1-score improvement ranging from 5.3% to 42.4%.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1221-1235"},"PeriodicalIF":10.4,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As a pivotal variant of multi-label classification, hierarchical text classification (HTC) faces unique challenges due to its intricate taxonomic hierarchy. Recent state-of-the-art approaches improve performance by considering both global hierarchy covering all labels and local hierarchy indicating substructure of sample-specific ground-truth labels. However, they often over-condense hierarchical information into one or several tokens, which may cause the loss of useful knowledge. Accordingly, we propose a dual classifier model with global and local hierarchies (DCGL). It adopts prompt tuning-based BERT as the backbone, where global hierarchy is integrated into the soft prompt template. And this resulting classifier branch is termed global pipeline. To mitigate information loss caused by hierarchy condensation, we introduce a parallel local hierarchy-aware classifier pipeline. This local pipeline acquires label-level classification features through text propagation on the label hierarchy and aligns these features with oracle label representations of local hierarchy via graph contrastive learning, which serve as a novel strategy for local hierarchy incorporation. Thereby, DCGL obtains more granular and targeted features and captures local hierarchy information such as label co-occurrence and local structure. Moreover, since global and local pipelines capture distinct yet complementary information, we further apply mutual knowledge distillation to bridge the gap between their output logits and facilitate mutual learning. And to better control the distillation degree, we design a dynamic temperature negatively correlated with label confidence. Comprehensive experiments demonstrate that our DCGL outperforms several representative HTC methods.
{"title":"Exploring Global and Local Hierarchies: Dual Classifier With Mutual Distillation for Hierarchical Text Classification","authors":"Yubin Li;Zhaojian Cui;Haokai Gao;Jiale Liu;Yuncheng Jiang","doi":"10.1109/TKDE.2025.3629743","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3629743","url":null,"abstract":"As a pivotal variant of multi-label classification, hierarchical text classification (HTC) faces unique challenges due to its intricate taxonomic hierarchy. Recent state-of-the-art approaches improve performance by considering both global hierarchy covering all labels and local hierarchy indicating substructure of sample-specific ground-truth labels. However, they often over-condense hierarchical information into one or several tokens, which may cause the loss of useful knowledge. Accordingly, we propose a dual classifier model with global and local hierarchies (DCGL). It adopts prompt tuning-based BERT as the backbone, where global hierarchy is integrated into the soft prompt template. And this resulting classifier branch is termed global pipeline. To mitigate information loss caused by hierarchy condensation, we introduce a parallel local hierarchy-aware classifier pipeline. This local pipeline acquires label-level classification features through text propagation on the label hierarchy and aligns these features with oracle label representations of local hierarchy via graph contrastive learning, which serve as a novel strategy for local hierarchy incorporation. Thereby, DCGL obtains more granular and targeted features and captures local hierarchy information such as label co-occurrence and local structure. Moreover, since global and local pipelines capture distinct yet complementary information, we further apply mutual knowledge distillation to bridge the gap between their output logits and facilitate mutual learning. And to better control the distillation degree, we design a dynamic temperature negatively correlated with label confidence. Comprehensive experiments demonstrate that our DCGL outperforms several representative HTC methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1084-1098"},"PeriodicalIF":10.4,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1109/TKDE.2025.3629816
Wenyu Fang;Wei Huang;Yanhua Liu;Jia Liu;Tianrui Li
Multi-party urban flow analysis is a crucial task in smart cities. However, existing analysis methods has difficulty in trade-off between data privacy security and spatio-temporal feature capture. The solution to the problem of how to capture the complete spatio-temporal features of multi-party urban flow data while protecting data privacy is of great importance in multi-party urban flow analysis. Therefore, to address data privacy and spatio-temporal feature capture in multi-party urban flow analysis, this paper proposes a spatio-temporal federated analysis model, for multi-party urban flow mining, which is able to effectively protect data privacy and capture spatio-temporal features completely at the same time. First, a multi-party urban flow mining framework based on federated learning is proposed to realize complete capture of spatio-temporal feature information of multi-party urban flow data and mining urban flow pattern knowledge under the premise of protecting data privacy. Second, to address the communication cost of the multi-party urban flow analysis, we propose a lazy aggregation method based on similarity clustering, which improves the communication efficiency between clients and the server. Further, we propose a similarity evaluation criteria for urban flow data based on step function, which can effectively calculate the similarity between urban flow data. Finally, we compare the proposed model with some benchmark methods on Chengdu Didi order data and point of interest data to prove the effectiveness of the proposed model and visualize and analyze the spatio-temporal features.
{"title":"Multi-Party Federated Urban Flow Mining and Analysis Based on Lazy Aggregation","authors":"Wenyu Fang;Wei Huang;Yanhua Liu;Jia Liu;Tianrui Li","doi":"10.1109/TKDE.2025.3629816","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3629816","url":null,"abstract":"Multi-party urban flow analysis is a crucial task in smart cities. However, existing analysis methods has difficulty in trade-off between data privacy security and spatio-temporal feature capture. The solution to the problem of how to capture the complete spatio-temporal features of multi-party urban flow data while protecting data privacy is of great importance in multi-party urban flow analysis. Therefore, to address data privacy and spatio-temporal feature capture in multi-party urban flow analysis, this paper proposes a spatio-temporal federated analysis model, for multi-party urban flow mining, which is able to effectively protect data privacy and capture spatio-temporal features completely at the same time. First, a multi-party urban flow mining framework based on federated learning is proposed to realize complete capture of spatio-temporal feature information of multi-party urban flow data and mining urban flow pattern knowledge under the premise of protecting data privacy. Second, to address the communication cost of the multi-party urban flow analysis, we propose a lazy aggregation method based on similarity clustering, which improves the communication efficiency between clients and the server. Further, we propose a similarity evaluation criteria for urban flow data based on step function, which can effectively calculate the similarity between urban flow data. Finally, we compare the proposed model with some benchmark methods on Chengdu Didi order data and point of interest data to prove the effectiveness of the proposed model and visualize and analyze the spatio-temporal features.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 1","pages":"475-488"},"PeriodicalIF":10.4,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-05DOI: 10.1109/TKDE.2025.3621708
Chang Cui;Yongqiang Tang;Yuxun Qu;Wensheng Zhang
Survival analysis is extensively employed to analyze the probability of the event of interest, particularly in the medical field. Most current research treats patients as isolated entities, neglecting the complex associations among them, which leads to underutilization of valuable information. Recently, several studies address this limitation by incorporating patient graph structures. However, these approaches generally overlook two critical issues: 1) the exploration of heterogeneous inter-patient relationships, and 2) flexible and scalable inductive inference for test samples. To overcome these challenges, this study introduces a novel framework, Multiplex Graph Guided Deep Survival Analysis (MGG-Surv). Specifically, we employ multiplex patient graphs to capture comprehensive inter-patient associative information. Furthermore, we propose a teacher-student dual network architecture, where the teacher network encodes multiplex graphs, and the learned graph knowledge is transferred to the student network via a unidirectional connection termed Graph-Guided Distillation. The student network integrates this graph knowledge to predict survival outcomes without requiring the patient graphs. These innovative designs facilitate comprehensive integration of inter-patient relationships while achieving flexible and scalable graph-free inference. Experiments on four datasets, encompassing both single and competing risks, demonstrate the superior performance of our framework.
生存分析被广泛用于分析感兴趣的事件的概率,特别是在医学领域。目前大多数研究将患者视为孤立的个体,忽视了他们之间的复杂联系,导致有价值的信息没有得到充分利用。最近,一些研究通过合并患者图结构来解决这一限制。然而,这些方法通常忽略了两个关键问题:1)对异质患者间关系的探索,以及2)对测试样本的灵活和可扩展的归纳推理。为了克服这些挑战,本研究引入了一个新的框架,即Multiplex Graph Guided Deep Survival Analysis (MGG-Surv)。具体来说,我们采用多路患者图来捕获全面的患者间关联信息。此外,我们提出了一种师生双网络架构,其中教师网络编码多路图,学习到的图知识通过称为图引导蒸馏的单向连接传输到学生网络。学生网络整合了这些图表知识来预测生存结果,而不需要病人的图表。这些创新的设计促进了患者间关系的全面集成,同时实现了灵活和可扩展的无图推理。在四个数据集上的实验,包括单一和竞争风险,证明了我们的框架的优越性能。
{"title":"Multiplex Graph Guided Deep Survival Analysis","authors":"Chang Cui;Yongqiang Tang;Yuxun Qu;Wensheng Zhang","doi":"10.1109/TKDE.2025.3621708","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3621708","url":null,"abstract":"Survival analysis is extensively employed to analyze the probability of the event of interest, particularly in the medical field. Most current research treats patients as isolated entities, neglecting the complex associations among them, which leads to underutilization of valuable information. Recently, several studies address this limitation by incorporating patient graph structures. However, these approaches generally overlook two critical issues: 1) the exploration of heterogeneous inter-patient relationships, and 2) flexible and scalable inductive inference for test samples. To overcome these challenges, this study introduces a novel framework, Multiplex Graph Guided Deep Survival Analysis (MGG-Surv). Specifically, we employ multiplex patient graphs to capture comprehensive inter-patient associative information. Furthermore, we propose a teacher-student dual network architecture, where the teacher network encodes multiplex graphs, and the learned graph knowledge is transferred to the student network via a unidirectional connection termed Graph-Guided Distillation. The student network integrates this graph knowledge to predict survival outcomes without requiring the patient graphs. These innovative designs facilitate comprehensive integration of inter-patient relationships while achieving flexible and scalable graph-free inference. Experiments on four datasets, encompassing both single and competing risks, demonstrate the superior performance of our framework.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 1","pages":"489-505"},"PeriodicalIF":10.4,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While stock prediction task traditionally relies on volume-price and fundamental data to predict the return ratio or price movement trend, sentiment factors derived from social media platforms such as StockTwits offer a complementary and useful source of real-time market information. However, we find that most social media posts, along with the public sentiment they reflect, provide limited value for trading predictions due to their noisy nature. To tackle this, we propose a novel dynamic expert tracing algorithm that filters out non-informative posts and identifies both true and inverse experts whose consistent predictions can serve as valuable trading signals. Our approach achieves significant improvements over existing expert identification methods in stock trend prediction. However, when using binary expert predictions to predict the return ratio, similar to all other expert identification methods, our approach faces a common challenge of signal sparsity with expert signals cover only about 4% of all stock-day combinations in our dataset. To address this challenge, we propose a dual graph attention neural network that effectively propagates expert signals across related stocks, enabling accurate prediction of return ratios and significantly increasing signal coverage. Empirical results show that our propagated expert-based signals not only exhibit strong predictive power independently but also work synergistically with traditional financial features. These combined signals significantly outperform representative baseline models in all quant-related metrics including predictive accuracy, return metrics, and correlation metrics, resulting in more robust investment strategies. We hope this work inspires further research into leveraging social media data for enhancing quantitative investment strategies.
{"title":"Unleashing Expert Opinion From Social Media for Stock Prediction","authors":"Wanyun Zhou;Saizhuo Wang;Xiang Li;Yiyan Qi;Jian Guo;Xiaowen Chu","doi":"10.1109/TKDE.2025.3626439","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3626439","url":null,"abstract":"While stock prediction task traditionally relies on volume-price and fundamental data to predict the return ratio or price movement trend, sentiment factors derived from social media platforms such as StockTwits offer a complementary and useful source of real-time market information. However, we find that most social media posts, along with the public sentiment they reflect, provide limited value for trading predictions due to their noisy nature. To tackle this, we propose a novel dynamic expert tracing algorithm that filters out non-informative posts and identifies both true and inverse experts whose consistent predictions can serve as valuable trading signals. Our approach achieves significant improvements over existing expert identification methods in stock trend prediction. However, when using binary expert predictions to predict the return ratio, similar to all other expert identification methods, our approach faces a common challenge of signal sparsity with expert signals cover only about 4% of all stock-day combinations in our dataset. To address this challenge, we propose a dual graph attention neural network that effectively propagates expert signals across related stocks, enabling accurate prediction of return ratios and significantly increasing signal coverage. Empirical results show that our propagated expert-based signals not only exhibit strong predictive power independently but also work synergistically with traditional financial features. These combined signals significantly outperform representative baseline models in all quant-related metrics including predictive accuracy, return metrics, and correlation metrics, resulting in more robust investment strategies. We hope this work inspires further research into leveraging social media data for enhancing quantitative investment strategies.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 2","pages":"1380-1394"},"PeriodicalIF":10.4,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145898200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weakly-Supervised Anomaly Detection (WSAD) has garnered increasing research interest in recent years, as it enables superior detection performance while demanding only a small fraction of labeled data. However, existing WSAD methods face two major limitations. From the data aspect, they struggle to detect anomalies between normal clusters or collective anomalies due to overlooking the multi-distribution and complex manifolds of real-world data. From the label aspect, they fall short of detecting unknown anomalies because of the label-insufficiency and anomaly contamination. To address these issues, we propose MMM, a unified WSAD framework for multi-distributional data. The framework consists of three components: a Multi-distribution data modeler captures latent representations of complex data distributions, followed by a Multiform feature extractor that extracts multiple underlying features from the modeler, highlighting the characteristics of potential anomalies. Finally, a Multi-strategy anomaly score estimator converts these features into anomaly scores, with the aid of a novel training approach with three strategies that maximize the utility of both data and labels. Experimental results showed that MMM achieved superior performance and robustness compared to state-of-the-art WSAD methods, while providing interpretable results that facilitate practical anomaly analysis.
{"title":"MMM: A Unified Weakly-Supervised Anomaly Detection Framework for Multi-Distributional Data","authors":"Xu Tan;Junqi Chen;Jiawei Yang;Jie Chen;Susanto Rahardja","doi":"10.1109/TKDE.2025.3626561","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3626561","url":null,"abstract":"Weakly-Supervised Anomaly Detection (WSAD) has garnered increasing research interest in recent years, as it enables superior detection performance while demanding only a small fraction of labeled data. However, existing WSAD methods face two major limitations. From the data aspect, they struggle to detect anomalies between normal clusters or collective anomalies due to overlooking the multi-distribution and complex manifolds of real-world data. From the label aspect, they fall short of detecting unknown anomalies because of the label-insufficiency and anomaly contamination. To address these issues, we propose MMM, a unified WSAD framework for multi-distributional data. The framework consists of three components: a Multi-distribution data modeler captures latent representations of complex data distributions, followed by a Multiform feature extractor that extracts multiple underlying features from the modeler, highlighting the characteristics of potential anomalies. Finally, a Multi-strategy anomaly score estimator converts these features into anomaly scores, with the aid of a novel training approach with three strategies that maximize the utility of both data and labels. Experimental results showed that MMM achieved superior performance and robustness compared to state-of-the-art WSAD methods, while providing interpretable results that facilitate practical anomaly analysis.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 1","pages":"442-456"},"PeriodicalIF":10.4,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-27DOI: 10.1109/TKDE.2025.3625818
Abd-Krim Seghouane;M. Ali Qadar;Inge Koch;Aref Miri Rekavandi
Canonical correlation analysis (CCA) is a widely used multivariate analysis technique for explaining the relation between two sets of variables. It achieves this goal by finding linear combinations of the variables with maximal correlation. Recently, under the assumption that leading canonical directions are sparse, various penalized CCA procedures have been proposed for high dimensional data applications. However, all these procedures have the inconvenience of not preserving the sparsity among the retained leading canonical directions. To address this issue, two new sparse CCA methods are proposed in this paper. The first method is obtained by diagonal thresholding of two square matrices derived from the cross-covariance matrix of the two sets of variables where each matrix characterizes one set of variables. A model selection criterion is used to select the number of variables to retain from each matrix diagonal. The second method is derived within an adaptive alternating penalized least squares framework where the $ell _{2}^{1}$-norm is used as a penalty promoting block sparsity. Compared to existing sparse CCA methods, the proposed methods have the advantage of preserving the sparsity across the retained canonical loading vectors. Their performance are illustrated in an extended experimental study which shows the superior performance of the proposed methods.
典型相关分析(CCA)是一种广泛使用的多变量分析技术,用于解释两组变量之间的关系。它通过寻找具有最大相关性的变量的线性组合来实现这一目标。近年来,在假设主导规范方向是稀疏的前提下,针对高维数据的应用,提出了各种惩罚CCA方法。然而,所有这些方法都有不能保持保留的主导规范方向之间的稀疏性的不便。为了解决这一问题,本文提出了两种新的稀疏CCA方法。第一种方法是由两组变量的交叉协方差矩阵得到的两个方阵的对角阈值分割,其中每个矩阵表征一组变量。模型选择标准用于从每个矩阵对角线中选择要保留的变量数量。第二种方法是在自适应交替惩罚最小二乘框架中推导出来的,其中$ well _{2}^{1}$-范数被用作促进块稀疏性的惩罚。与现有的稀疏CCA方法相比,所提出的方法具有在保留的规范加载向量上保持稀疏性的优点。在一个扩展的实验研究中说明了它们的性能,表明了所提出的方法的优越性能。
{"title":"Sparse Canonical Correlation Analysis With Preserved Sparsity","authors":"Abd-Krim Seghouane;M. Ali Qadar;Inge Koch;Aref Miri Rekavandi","doi":"10.1109/TKDE.2025.3625818","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3625818","url":null,"abstract":"Canonical correlation analysis (CCA) is a widely used multivariate analysis technique for explaining the relation between two sets of variables. It achieves this goal by finding linear combinations of the variables with maximal correlation. Recently, under the assumption that leading canonical directions are sparse, various penalized CCA procedures have been proposed for high dimensional data applications. However, all these procedures have the inconvenience of not preserving the sparsity among the retained leading canonical directions. To address this issue, two new sparse CCA methods are proposed in this paper. The first method is obtained by diagonal thresholding of two square matrices derived from the cross-covariance matrix of the two sets of variables where each matrix characterizes one set of variables. A model selection criterion is used to select the number of variables to retain from each matrix diagonal. The second method is derived within an adaptive alternating penalized least squares framework where the <inline-formula><tex-math>$ell _{2}^{1}$</tex-math></inline-formula>-norm is used as a penalty promoting block sparsity. Compared to existing sparse CCA methods, the proposed methods have the advantage of preserving the sparsity across the retained canonical loading vectors. Their performance are illustrated in an extended experimental study which shows the superior performance of the proposed methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 1","pages":"616-630"},"PeriodicalIF":10.4,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trust relationships play a crucial role in various domains, such as social spam detection, retweet behavior analytics, and recommendation systems. Trust is often implicit and difficult to observe directly in the real world, as it is driven by people’s underlying intentions and motivations. Therefore, when evaluating trust, it is critical to analyze not only user behavior data but also the intentions behind these behaviors that lead to trust. Existing trust evaluation methods often neglect the underlying reasons behind connections, such as shared hobbies or belonging to the same community. Therefore, these methods cannot differentiate the genuine intentions that lead to trust, resulting in an inaccurate evaluation of hidden trust relationships. To address this issue, we propose a novel Intent-based model for Trust Evaluation (INTRUST). This model can distinguish the intent behind high-order information in social communities using hypergraphs. Initially, we used hyperedges to represent high-order correlations between user-to-item and user-to-user interactions. Then, we construct $K$ intent prototypes, which serve as foundational elements to build trust. Furthermore, we distinguish $K$-independent intent subgraphs from these high-order correlations. To enhance the generalization and robustness of the model, we employ self-supervised learning and construct contrastive views at the node-level, hyperedge-level, and node-hyperedge-level. Extensive experiments on real-world datasets demonstrate that our model outperforms state-of-the-art approaches in terms of trust evaluation accuracy and efficiency.
{"title":"Intent-Based Trust Evaluation","authors":"Rongwei Xu;Guanfeng Liu;Yan Wang;Xuyun Zhang;Kai Zheng;Xiaofang Zhou","doi":"10.1109/TKDE.2025.3624874","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3624874","url":null,"abstract":"Trust relationships play a crucial role in various domains, such as social spam detection, retweet behavior analytics, and recommendation systems. Trust is often implicit and difficult to observe directly in the real world, as it is driven by people’s underlying intentions and motivations. Therefore, when evaluating trust, it is critical to analyze not only user behavior data but also the intentions behind these behaviors that lead to trust. Existing trust evaluation methods often neglect the underlying reasons behind connections, such as shared hobbies or belonging to the same community. Therefore, these methods cannot differentiate the genuine intentions that lead to trust, resulting in an inaccurate evaluation of hidden trust relationships. To address this issue, we propose a novel Intent-based model for Trust Evaluation (INTRUST). This model can distinguish the intent behind high-order information in social communities using hypergraphs. Initially, we used hyperedges to represent high-order correlations between user-to-item and user-to-user interactions. Then, we construct <inline-formula><tex-math>$K$</tex-math></inline-formula> intent prototypes, which serve as foundational elements to build trust. Furthermore, we distinguish <inline-formula><tex-math>$K$</tex-math></inline-formula>-independent intent subgraphs from these high-order correlations. To enhance the generalization and robustness of the model, we employ self-supervised learning and construct contrastive views at the node-level, hyperedge-level, and node-hyperedge-level. Extensive experiments on real-world datasets demonstrate that our model outperforms state-of-the-art approaches in terms of trust evaluation accuracy and efficiency.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 1","pages":"399-413"},"PeriodicalIF":10.4,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}