Pub Date : 2025-09-17DOI: 10.1109/TKDE.2025.3610932
Youwen Zhu;Shibo Dai;Pengfei Zhang;Xiqi Kuang
Input-discriminative local differential privacy (ID-LDP) protects user data with a different range of values, which improves the utility of the estimated data compared to traditional LDP. However, the existing ID-LDP methods are used for categorical data and cannot be directly applied to numerical data. In this paper, we propose a numerical data collection (NDC) framework with ID-LDP to provide discriminative protection for the data with different inputs. This framework uses a piecewise mechanism to divide the numerical data into several segments and designs two perturbation methods to minimize the mean value of numerical data based on values submitted by users. We first create an NDC-UE method that encodes the raw data into a binary vector. This method sets the uploaded data bit as 1 and the rest as zero and perturbs each bit with a given probability. We further propose an NDC-GRR algorithm to perturb the numerical data with an optimal privacy budget. To reduce the complexity of NDC-GRR, we apply a greedy algorithm-based spanner to shorten the computation time and improve the accuracy. Theoretical analysis proves that our schemes satisfy the definition of ID-LDP. Experimental results based on two real-world datasets and a synthetic dataset show that the proposed schemes have less mean square error compared with the benchmarks.
ID-LDP (Input-discriminative local differential privacy)对用户数据进行不同范围的保护,与传统LDP相比,提高了估计数据的利用率。但是,现有的ID-LDP方法用于分类数据,不能直接应用于数值数据。在本文中,我们提出了一个带有ID-LDP的数字数据收集(NDC)框架,为不同输入的数据提供区别保护。该框架采用分段机制将数值数据分成若干段,并根据用户提交的数值设计了两种微扰方法,使数值数据的均值最小。我们首先创建一个NDC-UE方法,将原始数据编码为二进制向量。该方法将上传的数据位设置为1,其余位设置为0,并以给定的概率扰动每个位。我们进一步提出了一种NDC-GRR算法,用最优隐私预算对数值数据进行扰动。为了降低NDC-GRR的复杂度,我们采用了一种基于贪心算法的扳手来缩短计算时间和提高精度。理论分析证明了我们的方案满足ID-LDP的定义。基于两个真实数据集和一个合成数据集的实验结果表明,与基准数据集相比,所提出的方案具有较小的均方误差。
{"title":"Numerical Data Collection Under Input-Discriminative Local Differential Privacy","authors":"Youwen Zhu;Shibo Dai;Pengfei Zhang;Xiqi Kuang","doi":"10.1109/TKDE.2025.3610932","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3610932","url":null,"abstract":"Input-discriminative local differential privacy (ID-LDP) protects user data with a different range of values, which improves the utility of the estimated data compared to traditional LDP. However, the existing ID-LDP methods are used for categorical data and cannot be directly applied to numerical data. In this paper, we propose a numerical data collection (NDC) framework with ID-LDP to provide discriminative protection for the data with different inputs. This framework uses a piecewise mechanism to divide the numerical data into several segments and designs two perturbation methods to minimize the mean value of numerical data based on values submitted by users. We first create an NDC-UE method that encodes the raw data into a binary vector. This method sets the uploaded data bit as 1 and the rest as zero and perturbs each bit with a given probability. We further propose an NDC-GRR algorithm to perturb the numerical data with an optimal privacy budget. To reduce the complexity of NDC-GRR, we apply a greedy algorithm-based spanner to shorten the computation time and improve the accuracy. Theoretical analysis proves that our schemes satisfy the definition of ID-LDP. Experimental results based on two real-world datasets and a synthetic dataset show that the proposed schemes have less mean square error compared with the benchmarks.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7346-7361"},"PeriodicalIF":10.4,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artificial intelligence (AI) advancements have significantly broadened its application across various sectors, simultaneously elevating concerns regarding the transparency and understandability of AI-driven decisions. Addressing these concerns, this paper embarks on an exploratory journey into Case-Based Reasoning (CBR) and Explainable Artificial Intelligence (XAI), critically examining their convergence and the potential this synergy holds for demystifying the decision-making processes of AI systems. We employ the concept of Explainable CBR (XCBR) system that leverages CBR to acquire case-based explanations or generate explanations using CBR methodologies to enhance AI decision explainability. Though the literature has few surveys on XCBR, recognizing its potential necessitates a detailed exploration of the principles for developing effective XCBR systems. We present a cycle-aligned perspective that examines how explainability functions can be embedded throughout the classical CBR phases: Retrieve, Reuse, Revise, and Retain. Drawing from a comprehensive literature review, we propose a set of six functional goals that reflect key explainability needs. These goals are mapped to six thematic categories, forming the basis of a structured XCBR taxonomy. The discussion extends to the broader challenges and prospects facing the CBR-XAI arena, setting the stage for future research directions. This paper offers design guidance and conceptual grounding for future XCBR research and system development.
{"title":"Empowering Explainable Artificial Intelligence Through Case-Based Reasoning: A Comprehensive Exploration","authors":"Preeja Pradeep;Marta Caro-Martínez;Anjana Wijekoon","doi":"10.1109/TKDE.2025.3609825","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3609825","url":null,"abstract":"Artificial intelligence (AI) advancements have significantly broadened its application across various sectors, simultaneously elevating concerns regarding the transparency and understandability of AI-driven decisions. Addressing these concerns, this paper embarks on an exploratory journey into Case-Based Reasoning (CBR) and Explainable Artificial Intelligence (XAI), critically examining their convergence and the potential this synergy holds for demystifying the decision-making processes of AI systems. We employ the concept of Explainable CBR (XCBR) system that leverages CBR to acquire case-based explanations or generate explanations using CBR methodologies to enhance AI decision explainability. Though the literature has few surveys on XCBR, recognizing its potential necessitates a detailed exploration of the principles for developing effective XCBR systems. We present a cycle-aligned perspective that examines how explainability functions can be embedded throughout the classical CBR phases: Retrieve, Reuse, Revise, and Retain. Drawing from a comprehensive literature review, we propose a set of six functional goals that reflect key explainability needs. These goals are mapped to six thematic categories, forming the basis of a structured XCBR taxonomy. The discussion extends to the broader challenges and prospects facing the CBR-XAI arena, setting the stage for future research directions. This paper offers design guidance and conceptual grounding for future XCBR research and system development.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7120-7139"},"PeriodicalIF":10.4,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11165042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-12DOI: 10.1109/TKDE.2025.3609415
Qi Xiong;Kai Tang;Minbo Ma;Ji Zhang;Jie Xu;Tianrui Li
Long-term time series forecasting (LTSF) is a critical task across diverse domains. Despite significant advancements in LTSF research, we identify a performance bottleneck in existing LTSF methods caused by the inadequate modeling of Temporal Dependencies within the Target (TDT). To address this issue, we propose a novel and generic temporal modeling framework, Temporal Dependency Alignment (TDAlign), that equips existing LTSF methods with TDT learning capabilities. TDAlign introduces two key innovations: 1) a loss function that aligns the change values between adjacent time steps in the predictions with those in the target, ensuring consistency with variation patterns, and 2) an adaptive loss balancing strategy that seamlessly integrates the new loss function with existing LTSF methods without introducing additional learnable parameters. As a plug-and-play framework, TDAlign enhances existing methods with minimal computational overhead, featuring only linear time complexity and constant space complexity relative to the prediction length. Extensive experiments on six strong LTSF baselines across seven real-world datasets demonstrate the effectiveness and flexibility of TDAlign. On average, TDAlign reduces baseline prediction errors by 1.47% to 9.19% and change value errors by 4.57% to 15.78%, highlighting its substantial performance improvements.
{"title":"Modeling Temporal Dependencies Within the Target for Long-Term Time Series Forecasting","authors":"Qi Xiong;Kai Tang;Minbo Ma;Ji Zhang;Jie Xu;Tianrui Li","doi":"10.1109/TKDE.2025.3609415","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3609415","url":null,"abstract":"Long-term time series forecasting (LTSF) is a critical task across diverse domains. Despite significant advancements in LTSF research, we identify a performance bottleneck in existing LTSF methods caused by the inadequate modeling of Temporal Dependencies within the Target (TDT). To address this issue, we propose a novel and generic temporal modeling framework, Temporal Dependency Alignment (TDAlign), that equips existing LTSF methods with TDT learning capabilities. TDAlign introduces two key innovations: 1) a loss function that aligns the change values between adjacent time steps in the predictions with those in the target, ensuring consistency with variation patterns, and 2) an adaptive loss balancing strategy that seamlessly integrates the new loss function with existing LTSF methods without introducing additional learnable parameters. As a plug-and-play framework, TDAlign enhances existing methods with minimal computational overhead, featuring only linear time complexity and constant space complexity relative to the prediction length. Extensive experiments on six strong LTSF baselines across seven real-world datasets demonstrate the effectiveness and flexibility of TDAlign. On average, TDAlign reduces baseline prediction errors by <bold>1.47%</b> to <bold>9.19%</b> and change value errors by <bold>4.57%</b> to <bold>15.78%</b>, highlighting its substantial performance improvements.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7300-7314"},"PeriodicalIF":10.4,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-12DOI: 10.1109/TKDE.2025.3608721
Xu Zhang;Yuheng Jia;Mofei Song;Ran Wang
Ensemble clustering aggregates multiple weak clusterings to achieve a more accurate and robust consensus result. The Co-Association matrix (CA matrix) based method is the mainstream ensemble clustering approach that constructs the similarity relationships between sample pairs according the weak clustering partitions to generate the final clustering result. However, the existing methods neglect that the quality of cluster is related to its size, i.e., a cluster with smaller size tends to higher accuracy. Moreover, they also do not consider the valuable dissimilarity information in the base clusterings which can reflect the varying importance of sample pairs that are completely disconnected. To this end, we propose the Similarity and Dissimilarity Guided Co-association matrix (SDGCA) to achieve ensemble clustering. First, we introduce normalized ensemble entropy to estimate the quality of each cluster, and construct a similarity matrix based on this estimation. Then, we employ the random walk to explore high-order proximity of base clusterings to construct a dissimilarity matrix. Finally, the adversarial relationship between the similarity matrix and the dissimilarity matrix is utilized to construct a promoted CA matrix for ensemble clustering. We compared our method with 13 state-of-the-art methods across 12 datasets, and the results demonstrated the superior clustering ability and robustness of the proposed approach.
{"title":"Similarity and Dissimilarity Guided Co-Association Matrix Construction for Ensemble Clustering","authors":"Xu Zhang;Yuheng Jia;Mofei Song;Ran Wang","doi":"10.1109/TKDE.2025.3608721","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3608721","url":null,"abstract":"Ensemble clustering aggregates multiple weak clusterings to achieve a more accurate and robust consensus result. The Co-Association matrix (CA matrix) based method is the mainstream ensemble clustering approach that constructs the similarity relationships between sample pairs according the weak clustering partitions to generate the final clustering result. However, the existing methods neglect that the quality of cluster is related to its size, i.e., a cluster with smaller size tends to higher accuracy. Moreover, they also do not consider the valuable dissimilarity information in the base clusterings which can reflect the varying importance of sample pairs that are completely disconnected. To this end, we propose the Similarity and Dissimilarity Guided Co-association matrix (SDGCA) to achieve ensemble clustering. First, we introduce normalized ensemble entropy to estimate the quality of each cluster, and construct a similarity matrix based on this estimation. Then, we employ the random walk to explore high-order proximity of base clusterings to construct a dissimilarity matrix. Finally, the adversarial relationship between the similarity matrix and the dissimilarity matrix is utilized to construct a promoted CA matrix for ensemble clustering. We compared our method with 13 state-of-the-art methods across 12 datasets, and the results demonstrated the superior clustering ability and robustness of the proposed approach.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 11","pages":"6694-6707"},"PeriodicalIF":10.4,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145242590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generating accurate SQL from users’ natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restrict the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summary and discuss the remaining challenges in this field and suggest expectations for future research directions.
{"title":"Next-Generation Database Interfaces: A Survey of LLM-Based Text-to-SQL","authors":"Zijin Hong;Zheng Yuan;Qinggang Zhang;Hao Chen;Junnan Dong;Feiran Huang;Xiao Huang","doi":"10.1109/TKDE.2025.3609486","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3609486","url":null,"abstract":"Generating accurate SQL from users’ natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restrict the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summary and discuss the remaining challenges in this field and suggest expectations for future research directions.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7328-7345"},"PeriodicalIF":10.4,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1109/TKDE.2025.3609045
Jingyi Xie;Jiawei Liu;Zheng-jun Zha
The rapid proliferation of multimedia fake news on social media has raised significant concerns in recent years. Existing studies on fake news detection predominantly adopt an instance-based paradigm, where the detector evaluates a single post to determine its veracity. Despite notable advancements achieved in this domain, we argue that the instance-based approach is misaligned with real-world deployment scenarios. In practice, detectors typically operate on servers that process incoming posts in temporal order, striving to assess their authenticity promptly. Instance-based detectors lack awareness of temporal information and contextual relationships between surrounding posts, therefore fail to capture long-range dependencies from the timeline. To bridge this gap, we introduce a more practical stream-based multi-modal fake news detection paradigm, which assumes that social media posts arrive continuously over time and allows the utilization of previously seen posts to aid in the classification of incoming ones. To enable effective and transferable fake news detection under this novel paradigm, we propose maintaining historical knowledge as a collection of incremental high-level forgery patterns. Based on this principle, we design a novel framework called Incremental Forgery Pattern Learning and Clues Refinement (IPLCR). IPLCR incrementally learns high-level forgery patterns as the stream evolves, leveraging this knowledge to improve the detection of newly arrived posts. At the core of IPLCR is the Incremental Forgery Pattern Bank (IPB), which dynamically summarizes historical posts into a set of latent forgery patterns. IPB is designed to continuously incorporate timely knowledge and actively discard obsolete information, even during inference. When a new post arrives, IPLCR retrieves the most relevant forgery pattern knowledge from IPB and refines the clues for fake news detection. The refined clues are subsequently incorporated into IPB to enrich its knowledge base. Extensive experiments validate IPLCR’s effectiveness as a robust stream-based detector. Moreover, IPLCR addresses several critical issues relevant to industrial applications, including seamless context transfer and efficient model upgrading, making it a practical solution for real-world deployment.
{"title":"Toward Effective and Transferable Detection for Multi-Modal Fake News in the Social Media Stream","authors":"Jingyi Xie;Jiawei Liu;Zheng-jun Zha","doi":"10.1109/TKDE.2025.3609045","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3609045","url":null,"abstract":"The rapid proliferation of multimedia fake news on social media has raised significant concerns in recent years. Existing studies on fake news detection predominantly adopt an instance-based paradigm, where the detector evaluates a single post to determine its veracity. Despite notable advancements achieved in this domain, we argue that the instance-based approach is misaligned with real-world deployment scenarios. In practice, detectors typically operate on servers that process incoming posts in temporal order, striving to assess their authenticity promptly. Instance-based detectors lack awareness of temporal information and contextual relationships between surrounding posts, therefore fail to capture long-range dependencies from the timeline. To bridge this gap, we introduce a more practical stream-based multi-modal fake news detection paradigm, which assumes that social media posts arrive continuously over time and allows the utilization of previously seen posts to aid in the classification of incoming ones. To enable effective and transferable fake news detection under this novel paradigm, we propose maintaining historical knowledge as a collection of incremental high-level forgery patterns. Based on this principle, we design a novel framework called Incremental Forgery Pattern Learning and Clues Refinement (IPLCR). IPLCR incrementally learns high-level forgery patterns as the stream evolves, leveraging this knowledge to improve the detection of newly arrived posts. At the core of IPLCR is the Incremental Forgery Pattern Bank (IPB), which dynamically summarizes historical posts into a set of latent forgery patterns. IPB is designed to continuously incorporate timely knowledge and actively discard obsolete information, even during inference. When a new post arrives, IPLCR retrieves the most relevant forgery pattern knowledge from IPB and refines the clues for fake news detection. The refined clues are subsequently incorporated into IPB to enrich its knowledge base. Extensive experiments validate IPLCR’s effectiveness as a robust stream-based detector. Moreover, IPLCR addresses several critical issues relevant to industrial applications, including seamless context transfer and efficient model upgrading, making it a practical solution for real-world deployment.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 11","pages":"6723-6737"},"PeriodicalIF":10.4,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145242610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rise of Large Language Models (LLMs), tourists increasingly use it for route planning by entering keywords for attractions, instead of relying on traditional manual map services. LLMs provide generally reasonable suggestions, but often fail to generate optimal plans that account for detailed user requirements, given the vast number of potential POIs and possible routes based on POI combinations within a real-world road network. In this case, a route-planning API could serve as an external tool, accepting a sequence of keywords and returning the top-$k$ best routes tailored to user requests. To address this need, this paper introduces the Keyword-Aware Top-$k$ Routes (KATR) query that provides a more flexible and comprehensive semantic to route planning that caters to various user’s preferences including flexible POI visiting order, flexible travel distance budget, and personalized POI ratings. Subsequently, we propose an explore-and-bound paradigm to efficiently process KATR queries by eliminating redundant candidates based on estimated score bounds from global to local levels. Extensive experiments demonstrate our approach’s superior performance over existing methods across different scenarios.
随着大型语言模型(Large Language Models, llm)的兴起,越来越多的游客不再依赖传统的手工地图服务,而是通过输入景点关键词来进行路线规划。llm通常提供合理的建议,但考虑到现实道路网络中大量潜在的POI和基于POI组合的可能路线,llm通常无法生成考虑详细用户需求的最佳计划。在这种情况下,路由规划API可以作为外部工具,接受一系列关键字并返回根据用户请求定制的前k个最佳路由。为了满足这一需求,本文引入了关键字感知的Top-$k$ Routes (KATR)查询,该查询为路线规划提供了更灵活、更全面的语义,以满足各种用户的偏好,包括灵活的POI访问顺序、灵活的旅行距离预算和个性化的POI评级。随后,我们提出了一种探索和绑定范式,通过基于从全局到局部级别的估计分数界限来消除冗余候选者,从而有效地处理KATR查询。大量的实验表明,我们的方法在不同场景下的性能优于现有方法。
{"title":"Flexible Keyword-Aware Top-$k$k Route Search","authors":"Ziqiang Yu;Xiaohui Yu;Yueting Chen;Wei Liu;Anbang Song;Bolong Zheng","doi":"10.1109/TKDE.2025.3609302","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3609302","url":null,"abstract":"With the rise of Large Language Models (LLMs), tourists increasingly use it for route planning by entering keywords for attractions, instead of relying on traditional manual map services. LLMs provide generally reasonable suggestions, but often fail to generate optimal plans that account for detailed user requirements, given the vast number of potential POIs and possible routes based on POI combinations within a real-world road network. In this case, a route-planning API could serve as an external tool, accepting a sequence of keywords and returning the top-<inline-formula><tex-math>$k$</tex-math></inline-formula> best routes tailored to user requests. To address this need, this paper introduces the Keyword-Aware Top-<inline-formula><tex-math>$k$</tex-math></inline-formula> Routes (KATR) query that provides a more flexible and comprehensive semantic to route planning that caters to various user’s preferences including flexible POI visiting order, flexible travel distance budget, and personalized POI ratings. Subsequently, we propose an explore-and-bound paradigm to efficiently process KATR queries by eliminating redundant candidates based on estimated score bounds from global to local levels. Extensive experiments demonstrate our approach’s superior performance over existing methods across different scenarios.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7184-7198"},"PeriodicalIF":10.4,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1109/TKDE.2025.3608723
Zidong Wang;Xiaoguang Gao;Qingfu Zhang
Learning graphical causal models from observational data can effectively elucidate the underlying causal mechanism behind the variables. In the context of limited datasets, modelers often incorporate prior knowledge, which is assumed to be correct, as a penalty in single-objective optimization. However, this approach struggles to adapt complex and uncertain priors effectively. This paper introduces UpCM, which tackles the issue from a multi-objective optimization perspective. Instead of focusing exclusively on the DAG as the optimization goal, UpCM methodically evaluate the effect of uncertain priors on specific structures, merging data-driven and knowledge-driven objectives. Utilizing the MOEA/D framework, it achieve a balanced trade-off between these objectives. Furthermore, since uncertain priors may introduce erroneous constraints, resulting in PDAGs lacking consistent extensions, the minimal non-consistent extension is explored. This extension, which separately incorporates positive and negative constraints, aims to approximate the true causality of the PDAGs. Experimental results demonstrate that UpCM achieves significant structural accuracy improvements compared to baseline methods. It reduces the SHD by 7.94%, 13.23%, and 12.8% relative to PC_stable, GES, and MAHC, respectively, when incorporating uncertain priors. In downstream inference tasks, UpCM outperforms domain-expert knowledge graphs, owing to its ability to learn explainable causal relationships that balance data-driven evidence with prior knowledge.
{"title":"Uncertain Priors for Graphical Causal Models: A Multi-Objective Optimization Perspective","authors":"Zidong Wang;Xiaoguang Gao;Qingfu Zhang","doi":"10.1109/TKDE.2025.3608723","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3608723","url":null,"abstract":"Learning graphical causal models from observational data can effectively elucidate the underlying causal mechanism behind the variables. In the context of limited datasets, modelers often incorporate prior knowledge, which is assumed to be correct, as a penalty in single-objective optimization. However, this approach struggles to adapt complex and uncertain priors effectively. This paper introduces UpCM, which tackles the issue from a multi-objective optimization perspective. Instead of focusing exclusively on the DAG as the optimization goal, UpCM methodically evaluate the effect of uncertain priors on specific structures, merging data-driven and knowledge-driven objectives. Utilizing the MOEA/D framework, it achieve a balanced trade-off between these objectives. Furthermore, since uncertain priors may introduce erroneous constraints, resulting in PDAGs lacking consistent extensions, the minimal non-consistent extension is explored. This extension, which separately incorporates positive and negative constraints, aims to approximate the true causality of the PDAGs. Experimental results demonstrate that UpCM achieves significant structural accuracy improvements compared to baseline methods. It reduces the SHD by 7.94%, 13.23%, and 12.8% relative to PC_stable, GES, and MAHC, respectively, when incorporating uncertain priors. In downstream inference tasks, UpCM outperforms domain-expert knowledge graphs, owing to its ability to learn explainable causal relationships that balance data-driven evidence with prior knowledge.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7426-7439"},"PeriodicalIF":10.4,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frequent object mining has gained considerable interest in the research community and can be split into frequent item mining and frequent set mining depending on the type of object. While existing sketch-based algorithms have made significant progress in addressing these two tasks concurrently, they also possess notable limitations. They either support only software platforms with low throughput or compromise accuracy for faster processing speed and better hardware compatibility. In this paper, we make a substantial stride towards supporting frequent object mining by designing SandwichSketch, which draws inspiration from sandwich making and proposes two techniques including the double fidelity enhancement and hierarchical hot locking to guarantee high fidelity on both two tasks. We implement SandwichSketch on three platforms (CPU, Redis, and FPGA) and show that it enhances accuracy by $38.4times$ and $5times$ for two tasks on three real-world datasets, respectively. Additionally, it supports a distributed measurement scenario with less than a 0.01% decrease in Average Relative Error (ARE) when the number of nodes increases from 1 to 16.
{"title":"SandwichSketch: A More Accurate Sketch for Frequent Object Mining in Data Streams","authors":"Zhuochen Fan;Ruixin Wang;Zihan Jiang;Ruwen Zhang;Tong Yang;Sha Wang;Yuhan Wu;Ruijie Miao;Kaicheng Yang;Bui Cui","doi":"10.1109/TKDE.2025.3607691","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3607691","url":null,"abstract":"Frequent object mining has gained considerable interest in the research community and can be split into frequent item mining and frequent set mining depending on the type of object. While existing sketch-based algorithms have made significant progress in addressing these two tasks concurrently, they also possess notable limitations. They either support only software platforms with low throughput or compromise accuracy for faster processing speed and better hardware compatibility. In this paper, we make a substantial stride towards supporting frequent object mining by designing SandwichSketch, which draws inspiration from sandwich making and proposes two techniques including the double fidelity enhancement and hierarchical hot locking to guarantee high fidelity on both two tasks. We implement SandwichSketch on three platforms (CPU, Redis, and FPGA) and show that it enhances accuracy by <inline-formula><tex-math>$38.4times$</tex-math></inline-formula> and <inline-formula><tex-math>$5times$</tex-math></inline-formula> for two tasks on three real-world datasets, respectively. Additionally, it supports a distributed measurement scenario with less than a 0.01% decrease in Average Relative Error (ARE) when the number of nodes increases from 1 to 16.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 11","pages":"6636-6650"},"PeriodicalIF":10.4,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145242587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The knob tuning aims to optimize database performance by searching for the most effective knob configuration under a certain workload. Existing works suffer from two significant problems. First, there exist multiple useless evaluations of knob tuning even with diverse searching methods because of the different sensitivities of knobs on a certain workload. Second, the single evaluation of knob configurations may bring overestimation or underestimation because of query performance uncertainty. To solve the above problems, we propose a query uncertainty-aware knob classifier, called ${sf KnobCF}$, to enhance knob tuning. Our method has three contributions: (1) We propose uncertainty-aware configuration estimation to improve the tuning process. (2) We design a few-shot uncertainty estimator that requires no extra data collection, ensuring high efficiency in practical tasks. (3) We provide a flexible framework that can be integrated into existing knob tuners and DBMSs without modification. Our experiments on four open-source benchmarks demonstrate that our method effectively reduces useless evaluations and improves the tuning results. Especially in TPCC, our method achieves competitive tuning results with only 60% to 70% time consumption compared to the full workload evaluations.
{"title":"KnobCF: Uncertainty-Aware Knob Tuning","authors":"Yu Yan;Junfang Huang;Hongzhi Wang;Jian Geng;Kaixin Zhang;Tao Yu","doi":"10.1109/TKDE.2025.3608030","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3608030","url":null,"abstract":"The knob tuning aims to optimize database performance by searching for the most effective knob configuration under a certain workload. Existing works suffer from two significant problems. First, there exist multiple useless evaluations of knob tuning even with diverse searching methods because of the different sensitivities of knobs on a certain workload. Second, the single evaluation of knob configurations may bring overestimation or underestimation because of query performance uncertainty. To solve the above problems, we propose a query uncertainty-aware knob classifier, called <inline-formula><tex-math>${sf KnobCF}$</tex-math></inline-formula>, to enhance knob tuning. Our method has three contributions: (1) We propose uncertainty-aware configuration estimation to improve the tuning process. (2) We design a few-shot uncertainty estimator that requires no extra data collection, ensuring high efficiency in practical tasks. (3) We provide a flexible framework that can be integrated into existing knob tuners and DBMSs without modification. Our experiments on four open-source benchmarks demonstrate that our method effectively reduces useless evaluations and improves the tuning results. Especially in TPCC, our method achieves competitive tuning results with only 60% to 70% time consumption compared to the full workload evaluations.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 12","pages":"7240-7254"},"PeriodicalIF":10.4,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145456054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}