Pub Date : 2025-09-11DOI: 10.1109/TKDE.2025.3608723
Zidong Wang;Xiaoguang Gao;Qingfu Zhang
Learning graphical causal models from observational data can effectively elucidate the underlying causal mechanism behind the variables. In the context of limited datasets, modelers often incorporate prior knowledge, which is assumed to be correct, as a penalty in single-objective optimization. However, this approach struggles to adapt complex and uncertain priors effectively. This paper introduces UpCM, which tackles the issue from a multi-objective optimization perspective. Instead of focusing exclusively on the DAG as the optimization goal, UpCM methodically evaluate the effect of uncertain priors on specific structures, merging data-driven and knowledge-driven objectives. Utilizing the MOEA/D framework, it achieve a balanced trade-off between these objectives. Furthermore, since uncertain priors may introduce erroneous constraints, resulting in PDAGs lacking consistent extensions, the minimal non-consistent extension is explored. This extension, which separately incorporates positive and negative constraints, aims to approximate the true causality of the PDAGs. Experimental results demonstrate that UpCM achieves significant structural accuracy improvements compared to baseline methods. It reduces the SHD by 7.94%, 13.23%, and 12.8% relative to PC_stable, GES, and MAHC, respectively, when incorporating uncertain priors. In downstream inference tasks, UpCM outperforms domain-expert knowledge graphs, owing to its ability to learn explainable causal relationships that balance data-driven evidence with prior knowledge.
{"title":"Uncertain Priors for Graphical Causal Models: A Multi-Objective Optimization Perspective","authors":"Zidong Wang;Xiaoguang Gao;Qingfu Zhang","doi":"10.1109/TKDE.2025.3608723","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3608723","url":null,"abstract":"Learning graphical causal models from observational data can effectively elucidate the underlying causal mechanism behind the variables. In the context of limited datasets, modelers often incorporate prior knowledge, which is assumed to be correct, as a penalty in single-objective optimization. However, this approach struggles to adapt complex and uncertain priors effectively. This paper introduces UpCM, which tackles the issue from a multi-objective optimization perspective. Instead of focusing exclusively on the DAG as the optimization goal, UpCM methodically evaluate the effect of uncertain priors on specific structures, merging data-driven and knowledge-driven objectives. Utilizing the MOEA/D framework, it achieve a balanced trade-off between these objectives. Furthermore, since uncertain priors may introduce erroneous constraints, resulting in PDAGs lacking consistent extensions, the minimal non-consistent extension is explored. This extension, which separately incorporates positive and negative constraints, aims to approximate the true causality of the PDAGs. Experimental results demonstrate that UpCM achieves significant structural accuracy improvements compared to baseline methods. It reduces the SHD by 7.94%, 13.23%, and 12.8% relative to PC_stable, GES, and MAHC, respectively, when incorporating uncertain priors. In downstream inference tasks, UpCM outperforms domain-expert knowledge graphs, owing to its ability to learn explainable causal relationships that balance data-driven evidence with prior knowledge.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7426-7439"},"PeriodicalIF":10.4,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frequent object mining has gained considerable interest in the research community and can be split into frequent item mining and frequent set mining depending on the type of object. While existing sketch-based algorithms have made significant progress in addressing these two tasks concurrently, they also possess notable limitations. They either support only software platforms with low throughput or compromise accuracy for faster processing speed and better hardware compatibility. In this paper, we make a substantial stride towards supporting frequent object mining by designing SandwichSketch, which draws inspiration from sandwich making and proposes two techniques including the double fidelity enhancement and hierarchical hot locking to guarantee high fidelity on both two tasks. We implement SandwichSketch on three platforms (CPU, Redis, and FPGA) and show that it enhances accuracy by $38.4times$ and $5times$ for two tasks on three real-world datasets, respectively. Additionally, it supports a distributed measurement scenario with less than a 0.01% decrease in Average Relative Error (ARE) when the number of nodes increases from 1 to 16.
{"title":"SandwichSketch: A More Accurate Sketch for Frequent Object Mining in Data Streams","authors":"Zhuochen Fan;Ruixin Wang;Zihan Jiang;Ruwen Zhang;Tong Yang;Sha Wang;Yuhan Wu;Ruijie Miao;Kaicheng Yang;Bui Cui","doi":"10.1109/TKDE.2025.3607691","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3607691","url":null,"abstract":"Frequent object mining has gained considerable interest in the research community and can be split into frequent item mining and frequent set mining depending on the type of object. While existing sketch-based algorithms have made significant progress in addressing these two tasks concurrently, they also possess notable limitations. They either support only software platforms with low throughput or compromise accuracy for faster processing speed and better hardware compatibility. In this paper, we make a substantial stride towards supporting frequent object mining by designing SandwichSketch, which draws inspiration from sandwich making and proposes two techniques including the double fidelity enhancement and hierarchical hot locking to guarantee high fidelity on both two tasks. We implement SandwichSketch on three platforms (CPU, Redis, and FPGA) and show that it enhances accuracy by <inline-formula><tex-math>$38.4times$</tex-math></inline-formula> and <inline-formula><tex-math>$5times$</tex-math></inline-formula> for two tasks on three real-world datasets, respectively. Additionally, it supports a distributed measurement scenario with less than a 0.01% decrease in Average Relative Error (ARE) when the number of nodes increases from 1 to 16.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 11","pages":"6636-6650"},"PeriodicalIF":10.4,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145242587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The knob tuning aims to optimize database performance by searching for the most effective knob configuration under a certain workload. Existing works suffer from two significant problems. First, there exist multiple useless evaluations of knob tuning even with diverse searching methods because of the different sensitivities of knobs on a certain workload. Second, the single evaluation of knob configurations may bring overestimation or underestimation because of query performance uncertainty. To solve the above problems, we propose a query uncertainty-aware knob classifier, called ${sf KnobCF}$, to enhance knob tuning. Our method has three contributions: (1) We propose uncertainty-aware configuration estimation to improve the tuning process. (2) We design a few-shot uncertainty estimator that requires no extra data collection, ensuring high efficiency in practical tasks. (3) We provide a flexible framework that can be integrated into existing knob tuners and DBMSs without modification. Our experiments on four open-source benchmarks demonstrate that our method effectively reduces useless evaluations and improves the tuning results. Especially in TPCC, our method achieves competitive tuning results with only 60% to 70% time consumption compared to the full workload evaluations.
{"title":"KnobCF: Uncertainty-Aware Knob Tuning","authors":"Yu Yan;Junfang Huang;Hongzhi Wang;Jian Geng;Kaixin Zhang;Tao Yu","doi":"10.1109/TKDE.2025.3608030","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3608030","url":null,"abstract":"The knob tuning aims to optimize database performance by searching for the most effective knob configuration under a certain workload. Existing works suffer from two significant problems. First, there exist multiple useless evaluations of knob tuning even with diverse searching methods because of the different sensitivities of knobs on a certain workload. Second, the single evaluation of knob configurations may bring overestimation or underestimation because of query performance uncertainty. To solve the above problems, we propose a query uncertainty-aware knob classifier, called <inline-formula><tex-math>${sf KnobCF}$</tex-math></inline-formula>, to enhance knob tuning. Our method has three contributions: (1) We propose uncertainty-aware configuration estimation to improve the tuning process. (2) We design a few-shot uncertainty estimator that requires no extra data collection, ensuring high efficiency in practical tasks. (3) We provide a flexible framework that can be integrated into existing knob tuners and DBMSs without modification. Our experiments on four open-source benchmarks demonstrate that our method effectively reduces useless evaluations and improves the tuning results. Especially in TPCC, our method achieves competitive tuning results with only 60% to 70% time consumption compared to the full workload evaluations.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7240-7254"},"PeriodicalIF":10.4,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145456054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Performing complex First-Order Logic (FOL) queries on knowledge graphs is crucial for advancing knowledge reasoning. Knowledge graphs encapsulate rich semantic interactions among entities, encompassing both explicit structural knowledge represented by triples $(e_{1}, r, e_{2})$ and implicit relational knowledge through multi-hop paths $(e_{1} stackrel{r_{1}}{rightarrow } cdots e_{3} cdots stackrel{r_{2}}{rightarrow } e_{2})$. Traditional models often focus solely on either triple-level or path-level knowledge, overlooking the benefits of integrating both to enhance logic query answering. This oversight leads to suboptimal representation learning and inefficient query reasoning. To overcome these challenges, we introduce a new Semantic-Aware representation learning model for Query-answering Embeddings (SAQE). Specifically, SAQE employs a joint learning approach that integrates triple-level and path-level knowledge semantics and captures both explicit and implicit contextual nuances within the knowledge graph, yielding more accurate and contextually relevant representations. To efficiently handle the large combinatorial search spaces in FOL reasoning, we propose a novel hierarchical reasoning optimization strategy by a multi-hop tree thus optimizing subqueries rooted at variable nodes in a divide-and-conquer manner. Theoretical analysis confirms that SAQE effectively supports various types of FOL reasoning and enhances generalizations for query answering. Extensive experiments demonstrate that our model achieves state-of-the-art performance across several established datasets.
{"title":"SAQE: Complex Logical Query Answering via Semantic-Aware Representation Learning","authors":"Zongsheng Cao;Qianqian Xu;Zhiyong Yang;Yuan He;Xiaochun Cao;Qingming Huang","doi":"10.1109/TKDE.2025.3603877","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3603877","url":null,"abstract":"Performing complex First-Order Logic (FOL) queries on knowledge graphs is crucial for advancing knowledge reasoning. Knowledge graphs encapsulate rich semantic interactions among entities, encompassing both explicit structural knowledge represented by triples <inline-formula><tex-math>$(e_{1}, r, e_{2})$</tex-math></inline-formula> and implicit relational knowledge through multi-hop paths <inline-formula><tex-math>$(e_{1} stackrel{r_{1}}{rightarrow } cdots e_{3} cdots stackrel{r_{2}}{rightarrow } e_{2})$</tex-math></inline-formula>. Traditional models often focus solely on either triple-level or path-level knowledge, overlooking the benefits of integrating both to enhance logic query answering. This oversight leads to suboptimal representation learning and inefficient query reasoning. To overcome these challenges, we introduce a new <b>S</b>emantic-<b>A</b>ware representation learning model for <b>Q</b>uery-answering <b>E</b>mbeddings (<b>SAQE</b>). Specifically, SAQE employs a joint learning approach that integrates triple-level and path-level knowledge semantics and captures both explicit and implicit contextual nuances within the knowledge graph, yielding more accurate and contextually relevant representations. To efficiently handle the large combinatorial search spaces in FOL reasoning, we propose a novel hierarchical reasoning optimization strategy by a multi-hop tree thus optimizing subqueries rooted at variable nodes in a divide-and-conquer manner. Theoretical analysis confirms that SAQE effectively supports various types of FOL reasoning and enhances generalizations for query answering. Extensive experiments demonstrate that our model achieves state-of-the-art performance across several established datasets.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 11","pages":"6651-6665"},"PeriodicalIF":10.4,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145242609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-03DOI: 10.1109/TKDE.2025.3605594
Yu Feng;Weixuan Liang;Xinhang Wan;Jiyuan Liu;Miaomiao Li;Xinwang Liu
Multi-view clustering (MVC) has demonstrated impressive performance due to its ability to capture both consistency and diversity information among views. However, most existing techniques assume that all views are available in advance, making them inadequate for stream-view data, such as intelligent transportation systems and medical imaging analysis, where memory constraints or privacy concerns prevent storing all previous views. Although some methods attempt to address this issue by capturing consistency information, they often fail to effectively extract both diversity information and cross-view relationships. We argue that these limitations are inherent to incremental multi-view clustering (IMVC), as the inability to retain all previous views inevitably leads to insufficient information utilization, thereby compromising performance. To address these challenges, we propose a novel algorithm, termed Incremental Multi-View Clustering with Cross-View Correlation and Diversity (CDIMVC). Unlike existing methods that only retain consistency information, CDIMVC also preserves diversity information and utilizes similarity matrices to capture cross-view relationships. To implement this method, we develop three key modules: the dynamic view correlation analysis module (DVCAM), the knowledge extraction module (KEM), and the knowledge transfer module (KTM). When a new view arrives, DVCAM first assesses its importance and correlations to historical views. Subsequently, KEM computes its consistency and diversity information by comparing it to that in the knowledge base. Finally, KTM facilitates the effective transmission of past knowledge, preventing the loss of historical information. By integrating these modules, CDIMVC can effectively capture cross-view relationships and diversity information, facilitating efficient knowledge updating and maintenance. An alternating procedure is also designed to optimize the resulting optimization problem. Experimental results show that CDIMVC exceeds state-of-the-art methods, demonstrating its effectiveness in handling stream-view data.
{"title":"Incremental Multi-View Clustering: Exploring Stream-View Correlations to Learn Consistency and Diversity","authors":"Yu Feng;Weixuan Liang;Xinhang Wan;Jiyuan Liu;Miaomiao Li;Xinwang Liu","doi":"10.1109/TKDE.2025.3605594","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3605594","url":null,"abstract":"Multi-view clustering (MVC) has demonstrated impressive performance due to its ability to capture both consistency and diversity information among views. However, most existing techniques assume that all views are available in advance, making them inadequate for stream-view data, such as intelligent transportation systems and medical imaging analysis, where memory constraints or privacy concerns prevent storing all previous views. Although some methods attempt to address this issue by capturing consistency information, they often fail to effectively extract both diversity information and cross-view relationships. We argue that these limitations are inherent to incremental multi-view clustering (IMVC), as the inability to retain all previous views inevitably leads to insufficient information utilization, thereby compromising performance. To address these challenges, we propose a novel algorithm, termed Incremental Multi-View Clustering with Cross-View Correlation and Diversity (CDIMVC). Unlike existing methods that only retain consistency information, CDIMVC also preserves diversity information and utilizes similarity matrices to capture cross-view relationships. To implement this method, we develop three key modules: the dynamic view correlation analysis module (DVCAM), the knowledge extraction module (KEM), and the knowledge transfer module (KTM). When a new view arrives, DVCAM first assesses its importance and correlations to historical views. Subsequently, KEM computes its consistency and diversity information by comparing it to that in the knowledge base. Finally, KTM facilitates the effective transmission of past knowledge, preventing the loss of historical information. By integrating these modules, CDIMVC can effectively capture cross-view relationships and diversity information, facilitating efficient knowledge updating and maintenance. An alternating procedure is also designed to optimize the resulting optimization problem. Experimental results show that CDIMVC exceeds state-of-the-art methods, demonstrating its effectiveness in handling stream-view data.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7226-7239"},"PeriodicalIF":10.4,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-02DOI: 10.1109/TKDE.2025.3605389
Peipei Li;Shiying Yu;Jiajun Li;Xuegang Hu
Real-world applications have produced massive short text streams. Contrary to the traditional normal texts, they present the characteristics such as short length, only having few labeled data, high-velocity, high-volume and dynamic data distributions, which deteriorate the issues of data sparseness, label missing and concept drift. Obviously, it is a huge challenge for existing short text (stream) classification algorithms due to the poor effectiveness, where they always assume all short texts are completely labeled and little attention is paid on the concept drift issue hidden in short text streams. Therefore, we propose a novel semi-supervised short text steam classification method based on the drift-aware incremental deep learning ensemble model. Specifically, with the sliding window mechanism, we first fuse three types of statistical, semantic and structure information to solve the data sparseness issue. Second, a semi-supervised incremental deep learning ensemble model based on GCN and the refined LSTM is developed to adapt to the high-volume, high-velocity and label missing short text streams. Third, a label-probability distribution based concept drift detector is introduced to distinguish concept drifts. Finally, as compared with eleven well-known classification methods, extensive experiments demonstrate the effectiveness of the proposed method in the handling of short text streams with limited labeled data.
{"title":"Semi-Supervised Short Text Stream Classification Based on Drift-Aware Incremental Deep Learning","authors":"Peipei Li;Shiying Yu;Jiajun Li;Xuegang Hu","doi":"10.1109/TKDE.2025.3605389","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3605389","url":null,"abstract":"Real-world applications have produced massive short text streams. Contrary to the traditional normal texts, they present the characteristics such as short length, only having few labeled data, high-velocity, high-volume and dynamic data distributions, which deteriorate the issues of data sparseness, label missing and concept drift. Obviously, it is a huge challenge for existing short text (stream) classification algorithms due to the poor effectiveness, where they always assume all short texts are completely labeled and little attention is paid on the concept drift issue hidden in short text streams. Therefore, we propose a novel semi-supervised short text steam classification method based on the drift-aware incremental deep learning ensemble model. Specifically, with the sliding window mechanism, we first fuse three types of statistical, semantic and structure information to solve the data sparseness issue. Second, a semi-supervised incremental deep learning ensemble model based on GCN and the refined LSTM is developed to adapt to the high-volume, high-velocity and label missing short text streams. Third, a label-probability distribution based concept drift detector is introduced to distinguish concept drifts. Finally, as compared with eleven well-known classification methods, extensive experiments demonstrate the effectiveness of the proposed method in the handling of short text streams with limited labeled data.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 11","pages":"6680-6693"},"PeriodicalIF":10.4,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145242612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-21DOI: 10.1109/TKDE.2025.3601198
Jiazheng Tian;Kun Xie;Xin Wang;Jigang Wen;Gaogang Xie;Wei Liang;Dafang Zhang;Kenli Li
Sparse data gathering has become a promising solution for reducing measurement costs by leveraging the inherent sparsity of data. However, most existing approaches rely on low-dimensional models such as compressive sensing or matrix completion, which are limited in capturing complex high-dimensional structures. To overcome these limitations, we propose TensorMon, a novel tensor-based sparse data gathering framework that introduces a cuboid sampling strategy to more effectively exploit multidimensional correlations. Unlike traditional entry-based or tube-based sampling, TensorMon introduces the innovative concept of cuboid sampling. We further develop a lightweight sampling scheduling algorithm and a non-iterative inference algorithm to ensure efficient measurement planning and accurate reconstruction of unmeasured data. Theoretical analysis establishes a new performance bound for our sampling strategy, which is significantly lower than those in existing literature. To validate our theoretical findings, we conduct extensive experiments on four real-world datasets: two network monitoring datasets, a city-scale crowd flow dataset, and a road traffic speed dataset. Experimental results demonstrate that TensorMon achieves substantial reductions in measurement cost, delivers high inference accuracy, and ensures rapid data recovery, highlighting its effectiveness and practicality across diverse application scenarios.
{"title":"TensorMon: A Breakthrough in Sparse Data Gathering Leveraging Tensor-Enhanced Techniques for System and Network Monitoring","authors":"Jiazheng Tian;Kun Xie;Xin Wang;Jigang Wen;Gaogang Xie;Wei Liang;Dafang Zhang;Kenli Li","doi":"10.1109/TKDE.2025.3601198","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3601198","url":null,"abstract":"Sparse data gathering has become a promising solution for reducing measurement costs by leveraging the inherent sparsity of data. However, most existing approaches rely on low-dimensional models such as compressive sensing or matrix completion, which are limited in capturing complex high-dimensional structures. To overcome these limitations, we propose <bold>TensorMon</b>, a novel tensor-based sparse data gathering framework that introduces a cuboid sampling strategy to more effectively exploit multidimensional correlations. Unlike traditional entry-based or tube-based sampling, TensorMon introduces the innovative concept of <italic>cuboid sampling</i>. We further develop a lightweight sampling scheduling algorithm and a non-iterative inference algorithm to ensure efficient measurement planning and accurate reconstruction of unmeasured data. Theoretical analysis establishes a new performance bound for our sampling strategy, which is significantly lower than those in existing literature. To validate our theoretical findings, we conduct extensive experiments on four real-world datasets: two network monitoring datasets, a city-scale crowd flow dataset, and a road traffic speed dataset. Experimental results demonstrate that TensorMon achieves substantial reductions in measurement cost, delivers high inference accuracy, and ensures rapid data recovery, highlighting its effectiveness and practicality across diverse application scenarios.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 11","pages":"6708-6722"},"PeriodicalIF":10.4,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145242600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-19DOI: 10.1109/TKDE.2025.3600103
Chengyi Liu;Jiahao Zhang;Shijie Wang;Wenqi Fan;Qing Li
With the prevalence of social networks on online platforms, social recommendation has become a vital technique for enhancing personalized recommendations. The effectiveness of social recommendations largely relies on the social homophily assumption, which presumes that individuals with social connections often share similar preferences. However, this foundational premise has been recently challenged due to the inherent complexity and noise present in real-world social networks. In this paper, we tackle the low social homophily challenge from an innovative generative perspective, directly generating optimal user social representations that maximize consistency with collaborative signals. Specifically, we propose the Score-based Generative Model for Social Recommendation (SGSR), which effectively adapts the Stochastic Differential Equation (SDE)-based diffusion models for social recommendations. To better fit the recommendation context, SGSR employs a joint curriculum training strategy to mitigate challenges related to missing supervision signals and leverages self-supervised learning techniques to align knowledge across social and collaborative domains. Extensive experiments on real-world datasets demonstrate the effectiveness of our approach in filtering redundant social information and improving recommendation performance.
{"title":"Score-Based Generative Diffusion Models for Social Recommendations","authors":"Chengyi Liu;Jiahao Zhang;Shijie Wang;Wenqi Fan;Qing Li","doi":"10.1109/TKDE.2025.3600103","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3600103","url":null,"abstract":"With the prevalence of social networks on online platforms, social recommendation has become a vital technique for enhancing personalized recommendations. The effectiveness of social recommendations largely relies on the social homophily assumption, which presumes that individuals with social connections often share similar preferences. However, this foundational premise has been recently challenged due to the inherent complexity and noise present in real-world social networks. In this paper, we tackle the low social homophily challenge from an innovative generative perspective, directly generating optimal user social representations that maximize consistency with collaborative signals. Specifically, we propose the Score-based Generative Model for Social Recommendation (SGSR), which effectively adapts the Stochastic Differential Equation (SDE)-based diffusion models for social recommendations. To better fit the recommendation context, SGSR employs a joint curriculum training strategy to mitigate challenges related to missing supervision signals and leverages self-supervised learning techniques to align knowledge across social and collaborative domains. Extensive experiments on real-world datasets demonstrate the effectiveness of our approach in filtering redundant social information and improving recommendation performance.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 11","pages":"6666-6679"},"PeriodicalIF":10.4,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145242599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-19DOI: 10.1109/TKDE.2025.3599265
Haohao Qu;Wenqi Fan;Zihuai Zhao;Qing Li
There is a growing interest in utilizing large language models (LLMs) to advance next-generation Recommender Systems (RecSys), driven by their outstanding language understanding and reasoning capabilities. In this scenario, tokenizing users and items becomes essential for ensuring seamless alignment of LLMs with recommendations. While studies have made progress in representing users and items using textual contents or latent representations, challenges remain in capturing high-order collaborative knowledge into discrete tokens compatible with LLMs and generalizing to unseen users/items. To address these challenges, we propose a novel framework called TokenRec, which introduces an effective ID tokenization strategy and an efficient retrieval paradigm for LLM-based recommendations. Our tokenization strategy involves quantizing the masked user/item representations learned from collaborative filtering into discrete tokens, thus achieving smooth incorporation of high-order collaborative knowledge and generalizable tokenization of users and items for LLM-based RecSys. Meanwhile, our generative retrieval paradigm is designed to efficiently recommend top-K items for users, eliminating the need for the time-consuming auto-regressive decoding and beam search processes used by LLMs, thus significantly reducing inference time. Comprehensive experiments validate the effectiveness of the proposed methods, demonstrating that TokenRec outperforms competitive benchmarks, including both traditional recommender systems and emerging LLM-based recommender systems.
{"title":"TokenRec: Learning to Tokenize ID for LLM-Based Generative Recommendations","authors":"Haohao Qu;Wenqi Fan;Zihuai Zhao;Qing Li","doi":"10.1109/TKDE.2025.3599265","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3599265","url":null,"abstract":"There is a growing interest in utilizing large language models (LLMs) to advance next-generation Recommender Systems (RecSys), driven by their outstanding language understanding and reasoning capabilities. In this scenario, tokenizing users and items becomes essential for ensuring seamless alignment of LLMs with recommendations. While studies have made progress in representing users and items using textual contents or latent representations, challenges remain in capturing high-order collaborative knowledge into discrete tokens compatible with LLMs and generalizing to unseen users/items. To address these challenges, we propose a novel framework called <bold>TokenRec</b>, which introduces an effective ID tokenization strategy and an efficient retrieval paradigm for LLM-based recommendations. Our tokenization strategy involves quantizing the masked user/item representations learned from collaborative filtering into discrete tokens, thus achieving smooth incorporation of high-order collaborative knowledge and generalizable tokenization of users and items for LLM-based RecSys. Meanwhile, our generative retrieval paradigm is designed to efficiently recommend top-K items for users, eliminating the need for the time-consuming auto-regressive decoding and beam search processes used by LLMs, thus significantly reducing inference time. Comprehensive experiments validate the effectiveness of the proposed methods, demonstrating that TokenRec outperforms competitive benchmarks, including both traditional recommender systems and emerging LLM-based recommender systems.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"6216-6231"},"PeriodicalIF":10.4,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145036983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-12DOI: 10.1109/TKDE.2025.3597995
Run-An Wang;Zhaonian Zou;Dandan Liu;Xudong Liu
Mining dense subgraphs on multilayer graphs offers the opportunity for more in-depth discoveries than classical dense subgraph mining on single-layer graphs. However, the existing approaches fail to ensure the denseness of a discovered subgraph on layers of users’ interest and simultaneously gain partial supports on the denseness from other layers. In this paper, we introduce a novel dense subgraph model called FocusCore (FoCore for short) for multilayer graphs, which can pay more attention to the layers focused by users. The FoCore decomposition problem, that is, identifying all nonempty FoCores in a multilayer graph, can be addressed by executing the peeling process with respect to all possible configurations of focus and background layers. Using the nice properties of FoCores, we devise an interleaved peeling algorithm and a vertex-centric algorithm toward efficient FoCore decomposition. We further design a novel cache to minimize the average retrieval time for an arbitrary FoCore without the need for full FoCore decomposition, which significantly improves efficiency in large-scale graph mining tasks. As an application, we propose a FoCore-decomposition-based algorithm to approximate the densest subgraph in a multilayer graph with a provable approximation guarantee. The extensive experiments on real-world datasets verify the effectiveness of the FoCore model and the efficiency of the proposed algorithms.
{"title":"FocusCores of Multilayer Graphs","authors":"Run-An Wang;Zhaonian Zou;Dandan Liu;Xudong Liu","doi":"10.1109/TKDE.2025.3597995","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3597995","url":null,"abstract":"Mining dense subgraphs on multilayer graphs offers the opportunity for more in-depth discoveries than classical dense subgraph mining on single-layer graphs. However, the existing approaches fail to ensure the denseness of a discovered subgraph on layers of users’ interest and simultaneously gain partial supports on the denseness from other layers. In this paper, we introduce a novel dense subgraph model called <underline>Fo</u>cus<underline>Core</u> (FoCore for short) for multilayer graphs, which can pay more attention to the layers focused by users. The FoCore decomposition problem, that is, identifying all nonempty FoCores in a multilayer graph, can be addressed by executing the peeling process with respect to all possible configurations of focus and background layers. Using the nice properties of FoCores, we devise an interleaved peeling algorithm and a vertex-centric algorithm toward efficient FoCore decomposition. We further design a novel cache to minimize the average retrieval time for an arbitrary FoCore without the need for full FoCore decomposition, which significantly improves efficiency in large-scale graph mining tasks. As an application, we propose a FoCore-decomposition-based algorithm to approximate the densest subgraph in a multilayer graph with a provable approximation guarantee. The extensive experiments on real-world datasets verify the effectiveness of the FoCore model and the efficiency of the proposed algorithms.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"5890-5904"},"PeriodicalIF":10.4,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145050780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}