IEEE Transactions on Knowledge and Data Engineering最新文献

Do as I Can, Not as I Get: Topology-Aware Multi-Hop Reasoning on Multi-Modal Knowledge Graphs

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2025-02-28 DOI: 10.1109/TKDE.2025.3546686

Shangfei Zheng;Hongzhi Yin;Tong Chen;Quoc Viet Hung Nguyen;Wei Chen;Lei Zhao

A multi-modal knowledge graph (MKG) includes triplets that consist of entities and relations and multi-modal auxiliary data. In recent years, multi-hop multi-modal knowledge graph reasoning (MMKGR) based on reinforcement learning (RL) has received extensive attention because it addresses the intrinsic incompleteness of MKG in an interpretable manner. However, its performance is limited by empirically designed rewards and sparse relations. In addition, this method has been designed for the transductive setting where test entities have been seen during training, and it works poorly in the inductive setting where test entities do not appear in the training set. To overcome these issues, we propose TMR (Topology-aware Multi-hop Reasoning), which can conduct MKG reasoning under inductive and transductive settings. Specifically, TMR mainly consists of two components. (1) The topology-aware inductive representation captures information from the directed relations of unseen entities, and aggregates query-related topology features in an attentive manner to generate the fine-grained entity-independent features. (2) After completing multi-modal feature fusion, the relation-augmented adaptive RL conducts multi-hop reasoning by eliminating manual rewards and dynamically adding actions. Finally, we construct new MKG datasets with different scales for inductive reasoning evaluation. Experimental results demonstrate that TMP outperforms state-of-the-art MKGR methods under both inductive and transductive settings.

{"title":"Do as I Can, Not as I Get: Topology-Aware Multi-Hop Reasoning on Multi-Modal Knowledge Graphs","authors":"Shangfei Zheng;Hongzhi Yin;Tong Chen;Quoc Viet Hung Nguyen;Wei Chen;Lei Zhao","doi":"10.1109/TKDE.2025.3546686","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3546686","url":null,"abstract":"A multi-modal knowledge graph (MKG) includes triplets that consist of entities and relations and multi-modal auxiliary data. In recent years, multi-hop multi-modal knowledge graph reasoning (MMKGR) based on reinforcement learning (RL) has received extensive attention because it addresses the intrinsic incompleteness of MKG in an interpretable manner. However, its performance is limited by empirically designed rewards and sparse relations. In addition, this method has been designed for the transductive setting where test entities have been seen during training, and it works poorly in the inductive setting where test entities do not appear in the training set. To overcome these issues, we propose <bold>TMR (<bold>Topology-aware <bold>Multi-hop <bold>Reasoning), which can conduct MKG reasoning under inductive and transductive settings. Specifically, TMR mainly consists of two components. (1) The topology-aware inductive representation captures information from the directed relations of unseen entities, and aggregates query-related topology features in an attentive manner to generate the fine-grained entity-independent features. (2) After completing multi-modal feature fusion, the relation-augmented adaptive RL conducts multi-hop reasoning by eliminating manual rewards and dynamically adding actions. Finally, we construct new MKG datasets with different scales for inductive reasoning evaluation. Experimental results demonstrate that TMP outperforms state-of-the-art MKGR methods under both inductive and transductive settings.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2405-2419"},"PeriodicalIF":8.9,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PipeFilter: Parallelizable and Space-Efficient Filter for Approximate Membership Query

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2025-02-26 DOI: 10.1109/TKDE.2025.3543881

Shankui Ji;Yang Du;He Huang;Yu-E Sun;Jia Liu;Yapeng Shu

Approximate membership query data structures (i.e., filters) have ubiquitous applications in database and data mining. Cuckoo filters are emerging as the alternative to Bloom filters because they support deletions and usually have higher operation throughput and space efficiency. However, their designs are confined to a single-threaded execution paradigm and consequently cannot fully exploit the parallel processing capabilities of modern hardware. This paper presents PipeFilter, a faster and more space-efficient filter that harnesses pipeline parallelism for superior performance. PipeFilter re-architects the Cuckoo filter by partitioning its data structure into several sub-filters, each providing a candidate position for every item. This allows the filter operations, including insertion, lookup, and deletion, to be naturally distributed across several pipeline stages, each overseeing one of the sub-filters, which can further be implemented through multi-threaded execution or pipeline stages of programmable hardware to achieve significantly higher throughput. Meanwhile, PipeFilter excels for single-threaded execution thanks to a combination of unique design features, including block design, path prophet, round robin, and SIMD optimization, such that it achieves superior performance than the SOTAs even when running with a single core. PipeFilter also has a competitive advantage in space utilization because it permits each item to explore more candidate positions. We implement and optimize PipeFilter on four platforms (single-core CPU, multi-core CPU, FPGA, and P4 ASIC). Experimental results demonstrate that PipeFilter surpasses all baseline methods on four platforms. When running with a single core, it showcases a notable 15%

$sim$

57% improvement in operation throughput and a high load factor exceeding 99%. When parallel processing on other platforms, PipeFilter achieves 7

$times sim 800times$

higher throughput than single-threaded execution.

{"title":"PipeFilter: Parallelizable and Space-Efficient Filter for Approximate Membership Query","authors":"Shankui Ji;Yang Du;He Huang;Yu-E Sun;Jia Liu;Yapeng Shu","doi":"10.1109/TKDE.2025.3543881","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3543881","url":null,"abstract":"Approximate membership query data structures (i.e., filters) have ubiquitous applications in database and data mining. Cuckoo filters are emerging as the alternative to Bloom filters because they support deletions and usually have higher operation throughput and space efficiency. However, their designs are confined to a single-threaded execution paradigm and consequently cannot fully exploit the parallel processing capabilities of modern hardware. This paper presents PipeFilter, a faster and more space-efficient filter that harnesses pipeline parallelism for superior performance. PipeFilter re-architects the Cuckoo filter by partitioning its data structure into several sub-filters, each providing a candidate position for every item. This allows the filter operations, including insertion, lookup, and deletion, to be naturally distributed across several pipeline stages, each overseeing one of the sub-filters, which can further be implemented through multi-threaded execution or pipeline stages of programmable hardware to achieve significantly higher throughput. Meanwhile, PipeFilter excels for single-threaded execution thanks to a combination of unique design features, including block design, path prophet, round robin, and SIMD optimization, such that it achieves superior performance than the SOTAs even when running with a single core. PipeFilter also has a competitive advantage in space utilization because it permits each item to explore more candidate positions. We implement and optimize PipeFilter on four platforms (single-core CPU, multi-core CPU, FPGA, and P4 ASIC). Experimental results demonstrate that PipeFilter surpasses all baseline methods on four platforms. When running with a single core, it showcases a notable 15%<inline-formula><tex-math>$sim$</tex-math></inline-formula>57% improvement in operation throughput and a high load factor exceeding 99%. When parallel processing on other platforms, PipeFilter achieves 7<inline-formula><tex-math>$times sim 800times$</tex-math></inline-formula> higher throughput than single-threaded execution.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2816-2830"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Universal Pre-Training and Prompting Framework for General Urban Spatio-Temporal Prediction

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2025-02-26 DOI: 10.1109/TKDE.2025.3545948

Yuan Yuan;Jingtao Ding;Jie Feng;Depeng Jin;Yong Li

Urban spatio-temporal prediction is crucial for informed decision-making, such as traffic management, resource optimization, and emergency response. Despite remarkable breakthroughs in pretrained natural language models that enable one model to handle diverse tasks, a universal solution for spatio-temporal prediction remains challenging. Existing prediction approaches are typically tailored for specific spatio-temporal scenarios, requiring task-specific model designs and extensive domain-specific training data. In this study, we introduce UniST, a universal model designed for general urban spatio-temporal prediction across a wide range of scenarios. Inspired by large language models, UniST achieves success through: (i) utilizing diverse spatio-temporal data from different scenarios, (ii) effective pre-training to capture complex spatio-temporal dynamics, (iii) knowledge-guided prompts to enhance generalization capabilities. These designs together unlock the potential of building a universal model for various scenarios. Extensive experiments on more than 20 spatio-temporal scenarios, including grid-based data and graph-based data, demonstrate UniST’s efficacy in advancing state-of-the-art performance, especially in few-shot and zero-shot prediction.

{"title":"A Universal Pre-Training and Prompting Framework for General Urban Spatio-Temporal Prediction","authors":"Yuan Yuan;Jingtao Ding;Jie Feng;Depeng Jin;Yong Li","doi":"10.1109/TKDE.2025.3545948","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3545948","url":null,"abstract":"Urban spatio-temporal prediction is crucial for informed decision-making, such as traffic management, resource optimization, and emergency response. Despite remarkable breakthroughs in pretrained natural language models that enable one model to handle diverse tasks, a universal solution for spatio-temporal prediction remains challenging. Existing prediction approaches are typically tailored for specific spatio-temporal scenarios, requiring task-specific model designs and extensive domain-specific training data. In this study, we introduce UniST, a universal model designed for general urban spatio-temporal prediction across a wide range of scenarios. Inspired by large language models, UniST achieves success through: (i) utilizing diverse spatio-temporal data from different scenarios, (ii) effective pre-training to capture complex spatio-temporal dynamics, (iii) knowledge-guided prompts to enhance generalization capabilities. These designs together unlock the potential of building a universal model for various scenarios. Extensive experiments on more than 20 spatio-temporal scenarios, including grid-based data and graph-based data, demonstrate UniST’s efficacy in advancing state-of-the-art performance, especially in few-shot and zero-shot prediction.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2212-2225"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Sequential Recommendations via Bidirectional Temporal Data Augmentation With Pre-Training

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2025-02-26 DOI: 10.1109/TKDE.2025.3546035

Juyong Jiang;Peiyan Zhang;Yingtao Luo;Chaozhuo Li;Jae Boum Kim;Kai Zhang;Senzhang Wang;Sunghun Kim;Philip S. Yu

Sequential recommendation systems are integral to discerning temporal user preferences. Yet, the task of learning from abbreviated user interaction sequences poses a notable challenge. Data augmentation has been identified as a potent strategy to enhance the informational richness of these sequences. Traditional augmentation techniques, such as item randomization, may disrupt the inherent temporal dynamics. Although recent advancements in reverse chronological pseudo-item generation have shown promise, they can introduce temporal discrepancies when assessed in a natural chronological context. In response, we introduce a sophisticated approach, Bidirectional temporal data Augmentation with pre-training (BARec). Our approach leverages bidirectional temporal augmentation and knowledge-enhanced fine-tuning to synthesize authentic pseudo-prior items that retain user preferences and capture deeper item semantic correlations, thus boosting the model’s expressive power. Our comprehensive experimental analysis on five benchmark datasets confirms the superiority of BARec across both short and elongated sequence contexts. Moreover, theoretical examination and case study offer further insight into the model’s logical processes and interpretability.

{"title":"Improving Sequential Recommendations via Bidirectional Temporal Data Augmentation With Pre-Training","authors":"Juyong Jiang;Peiyan Zhang;Yingtao Luo;Chaozhuo Li;Jae Boum Kim;Kai Zhang;Senzhang Wang;Sunghun Kim;Philip S. Yu","doi":"10.1109/TKDE.2025.3546035","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3546035","url":null,"abstract":"Sequential recommendation systems are integral to discerning temporal user preferences. Yet, the task of learning from abbreviated user interaction sequences poses a notable challenge. Data augmentation has been identified as a potent strategy to enhance the informational richness of these sequences. Traditional augmentation techniques, such as item randomization, may disrupt the inherent temporal dynamics. Although recent advancements in reverse chronological pseudo-item generation have shown promise, they can introduce temporal discrepancies when assessed in a natural chronological context. In response, we introduce a sophisticated approach, Bidirectional temporal data Augmentation with pre-training (BARec). Our approach leverages bidirectional temporal augmentation and knowledge-enhanced fine-tuning to synthesize authentic pseudo-prior items that <italic>retain user preferences and capture deeper item semantic correlations, thus boosting the model’s expressive power. Our comprehensive experimental analysis on five benchmark datasets confirms the superiority of BARec across both short and elongated sequence contexts. Moreover, theoretical examination and case study offer further insight into the model’s logical processes and interpretability.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2652-2664"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Estimating Multi-Label Expected Accuracy Using Labelset Distributions

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2025-02-26 DOI: 10.1109/TKDE.2025.3545972

Laurence A. F. Park;Jesse Read

A multi-label classifier estimates the binary label state (relevant/irrelevant) for each of a set of concept labels, for a given instance. Probabilistic multi-label classifiers provide a distribution over all possible labelset combinations of such label states (the powerset of labels), from which we can provide the best estimate by selecting the labelset corresponding to the largest expected accuracy. Providing confidence for predictions is important for real-world application of multi-label models, which provides the practitioner with a sense of the correctness of the prediction. It has been thought that the probability of the chosen labelset is a good measure of the confidence of the prediction, but multi-label accuracy can be measured in many ways and so confidence should align with the expected accuracy of the evaluation method. In this article, we investigate the effectiveness of seven candidate functions for estimating multi-label expected accuracy conditioned on the labelset distribution and the evaluation method. We found most correlate to expected accuracy and have varying levels of robustness. Further, we found that the candidate functions provide high expected accuracy estimates for Hamming similarity, but a combination of the candidates provided an accurate estimate of expected accuracy for Jaccard index and Exact match.

{"title":"Estimating Multi-Label Expected Accuracy Using Labelset Distributions","authors":"Laurence A. F. Park;Jesse Read","doi":"10.1109/TKDE.2025.3545972","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3545972","url":null,"abstract":"A multi-label classifier estimates the binary label state (relevant/irrelevant) for each of a set of concept labels, for a given instance. Probabilistic multi-label classifiers provide a distribution over all possible labelset combinations of such label states (the powerset of labels), from which we can provide the best estimate by selecting the labelset corresponding to the largest expected accuracy. Providing confidence for predictions is important for real-world application of multi-label models, which provides the practitioner with a sense of the correctness of the prediction. It has been thought that the probability of the chosen labelset is a good measure of the confidence of the prediction, but multi-label accuracy can be measured in many ways and so confidence should align with the expected accuracy of the evaluation method. In this article, we investigate the effectiveness of seven candidate functions for estimating multi-label expected accuracy conditioned on the labelset distribution and the evaluation method. We found most correlate to expected accuracy and have varying levels of robustness. Further, we found that the candidate functions provide high expected accuracy estimates for Hamming similarity, but a combination of the candidates provided an accurate estimate of expected accuracy for Jaccard index and Exact match.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2513-2524"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Doing More With Less: A Survey of Data Selection Methods for Mathematical Modeling

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2025-02-26 DOI: 10.1109/TKDE.2025.3545965

Nicolai A. Weinreich;Arman Oshnoei;Remus Teodorescu;Kim G. Larsen

Big data applications such as Artificial Intelligence (AI) and Internet of Things (IoT) have in recent years been leading to many technological breakthroughs in system modeling. However, these applications are typically data intensive, thus requiring an increasing cost of resources. In this paper, a first-of-its-kind comprehensive review of data selection methods across different engineering disciplines is given in order to analyze the effectiveness of these methods in improving the data efficiency of mathematical modeling algorithms. Eight distinct selection methods have been identified and subsequently analyzed and discussed on the basis of the relevant literature. In addition, the selection methods have been classified according to three dichotomies established by the survey. A comparative analysis of these methods was conducted along with a discussion of potentials, challenges, and future research directions for the research area. Data selection was found to be widely used in many engineering applications and has the potential to play an important role in making more sustainable Big Data applications, especially those in which transmission of data across large distances is required. Furthermore, making resource-aware decisions about the use of data has been shown to be highly effective in reducing energy costs while ensuring high performance of the model.

{"title":"Doing More With Less: A Survey of Data Selection Methods for Mathematical Modeling","authors":"Nicolai A. Weinreich;Arman Oshnoei;Remus Teodorescu;Kim G. Larsen","doi":"10.1109/TKDE.2025.3545965","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3545965","url":null,"abstract":"Big data applications such as Artificial Intelligence (AI) and Internet of Things (IoT) have in recent years been leading to many technological breakthroughs in system modeling. However, these applications are typically data intensive, thus requiring an increasing cost of resources. In this paper, a first-of-its-kind comprehensive review of data selection methods across different engineering disciplines is given in order to analyze the effectiveness of these methods in improving the data efficiency of mathematical modeling algorithms. Eight distinct selection methods have been identified and subsequently analyzed and discussed on the basis of the relevant literature. In addition, the selection methods have been classified according to three dichotomies established by the survey. A comparative analysis of these methods was conducted along with a discussion of potentials, challenges, and future research directions for the research area. Data selection was found to be widely used in many engineering applications and has the potential to play an important role in making more sustainable Big Data applications, especially those in which transmission of data across large distances is required. Furthermore, making resource-aware decisions about the use of data has been shown to be highly effective in reducing energy costs while ensuring high performance of the model.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2420-2439"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10904270","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient and Accurate Spatial Queries Using Lossy Compressed 3D Geometry Data

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2025-02-26 DOI: 10.1109/TKDE.2025.3539729

Dejun Teng;Zhaochuan Li;Zhaohui Peng;Shuai Ma;Fusheng Wang

3D spatial data management is increasingly vital across various application scenarios, such as GIS, digital twins, human atlases, and tissue imaging. However, the inherent complexity of 3D spatial data, primarily represented by 3D geometries in real-world applications, hinders the efficient evaluation of spatial relationships through resource-intensive geometric computations. Geometric simplification algorithms have been developed to reduce the complexity of 3D representations, albeit at the cost of querying accuracy. Previous work has aimed to address precision loss by leveraging the spatial relationship between the simplified and original 3D object representations. However, this approach relied on specialized geometric simplification algorithms tailored to regions with specific criteria. In this paper, we introduce a novel approach to achieve highly efficient and accurate 3D spatial queries, incorporating geometric computation and simplification. We present a generalized progressive refinement methodology applicable to general geometric simplification algorithms, involving accurate querying of 3D geometry data using low-resolution representations and simplification extents quantified using Hausdorff distances at the facet level. Additionally, we propose techniques for calculating and storing Hausdorff distances efficiently. Extensive experimental evaluations validate the effectiveness of the proposed method which outperforms state-of-the-art systems by a factor of 4 while minimizing computational and storage overhead.

{"title":"Efficient and Accurate Spatial Queries Using Lossy Compressed 3D Geometry Data","authors":"Dejun Teng;Zhaochuan Li;Zhaohui Peng;Shuai Ma;Fusheng Wang","doi":"10.1109/TKDE.2025.3539729","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3539729","url":null,"abstract":"3D spatial data management is increasingly vital across various application scenarios, such as GIS, digital twins, human atlases, and tissue imaging. However, the inherent complexity of 3D spatial data, primarily represented by 3D geometries in real-world applications, hinders the efficient evaluation of spatial relationships through resource-intensive geometric computations. Geometric simplification algorithms have been developed to reduce the complexity of 3D representations, albeit at the cost of querying accuracy. Previous work has aimed to address precision loss by leveraging the spatial relationship between the simplified and original 3D object representations. However, this approach relied on specialized geometric simplification algorithms tailored to regions with specific criteria. In this paper, we introduce a novel approach to achieve highly efficient and accurate 3D spatial queries, incorporating geometric computation and simplification. We present a generalized progressive refinement methodology applicable to general geometric simplification algorithms, involving accurate querying of 3D geometry data using low-resolution representations and simplification extents quantified using Hausdorff distances at the facet level. Additionally, we propose techniques for calculating and storing Hausdorff distances efficiently. Extensive experimental evaluations validate the effectiveness of the proposed method which outperforms state-of-the-art systems by a factor of 4 while minimizing computational and storage overhead.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2472-2487"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel Expandable Borderline Smote Over-Sampling Method for Class Imbalance Problem

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2025-02-25 DOI: 10.1109/TKDE.2025.3544284

Hao Sun;Jianping Li;Xiaoqian Zhu

The class imbalance problem can cause classifiers to be biased toward the majority class and inclined to generate incorrect predictions. While existing studies have proposed numerous oversampling methods to alleviate class imbalance by generating extra minority class samples, these methods still have some inherent weaknesses and make the generated samples less informative. This study proposes a novel over-sampling method named the Expandable Borderline Smote (EB-Smote), which can address the weaknesses of existing over-sampling methods and generate more informative synthetic samples. In EB-Smote, not only minority class but also majority class is oversampled, and the synthetic samples are generated in the area between the selected minority and majority samples, which are close to the borderlines of their respective classes. EB-Smote can generate more informative samples by expanding the borderlines of minority and majority classes toward the actual decision boundary. Based on 27 imbalanced datasets and commonly used machine learning models, the experimental results demonstrate that EB-Smote significantly outperforms the other 8 existing oversampling methods. This study can provide theoretical guidance and practical recommendations to solve the crucial class imbalance problem in classification tasks.

{"title":"A Novel Expandable Borderline Smote Over-Sampling Method for Class Imbalance Problem","authors":"Hao Sun;Jianping Li;Xiaoqian Zhu","doi":"10.1109/TKDE.2025.3544284","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3544284","url":null,"abstract":"The class imbalance problem can cause classifiers to be biased toward the majority class and inclined to generate incorrect predictions. While existing studies have proposed numerous oversampling methods to alleviate class imbalance by generating extra minority class samples, these methods still have some inherent weaknesses and make the generated samples less informative. This study proposes a novel over-sampling method named the Expandable Borderline Smote (EB-Smote), which can address the weaknesses of existing over-sampling methods and generate more informative synthetic samples. In EB-Smote, not only minority class but also majority class is oversampled, and the synthetic samples are generated in the area between the selected minority and majority samples, which are close to the borderlines of their respective classes. EB-Smote can generate more informative samples by expanding the borderlines of minority and majority classes toward the actual decision boundary. Based on 27 imbalanced datasets and commonly used machine learning models, the experimental results demonstrate that EB-Smote significantly outperforms the other 8 existing oversampling methods. This study can provide theoretical guidance and practical recommendations to solve the crucial class imbalance problem in classification tasks.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2183-2199"},"PeriodicalIF":8.9,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Amortized O(1) Lower Bound for Dynamic Time Warping in Motif Discovery

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2025-02-24 DOI: 10.1109/TKDE.2025.3544751

Zemin Chao;Hong Gao;Dongjing Miao;Jianzhong Li;Hongzhi Wang

Motif discovery is a critical operation for analyzing series data in many applications. Recent works demonstrate the importance of finding motifs with Dynamic Time Warping. However, existing algorithms spend most of their time in computing lower bounds of Dynamic Time Warping to filter out the unpromising candidates. Specifically, the time complexity for computing these lower bounds is

$O(L)$

for each pair of subsequences, where

$L$

is the length of the motif (subsequences). This paper proposes two new lower bounds, called

$LB_{f}$

and

$LB_{M}$

, both of them only cost amortized

$O(1)$

time for each pair of subsequences. On real datasets, the proposed lower bounds are at least one magnitude faster than the state-of-the-art lower bounds used in motif discovery while still keeping satisfying effectiveness. Based on these faster lower bounds, this paper designs an efficient motif discovery algorithm that significantly reduces the cost of lower bounds. The experiments conducted on real datasets show the proposed algorithm is 5.6 times faster than the state-of-the-art algorithms on average.

{"title":"An Amortized O(1) Lower Bound for Dynamic Time Warping in Motif Discovery","authors":"Zemin Chao;Hong Gao;Dongjing Miao;Jianzhong Li;Hongzhi Wang","doi":"10.1109/TKDE.2025.3544751","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3544751","url":null,"abstract":"Motif discovery is a critical operation for analyzing series data in many applications. Recent works demonstrate the importance of finding motifs with Dynamic Time Warping. However, existing algorithms spend most of their time in computing lower bounds of Dynamic Time Warping to filter out the unpromising candidates. Specifically, the time complexity for computing these lower bounds is <inline-formula><tex-math>$O(L)$</tex-math></inline-formula> for each pair of subsequences, where <inline-formula><tex-math>$L$</tex-math></inline-formula> is the length of the motif (subsequences). This paper proposes two new lower bounds, called <inline-formula><tex-math>$LB_{f}$</tex-math></inline-formula> and <inline-formula><tex-math>$LB_{M}$</tex-math></inline-formula>, both of them only cost amortized <inline-formula><tex-math>$O(1)$</tex-math></inline-formula> time for each pair of subsequences. On real datasets, the proposed lower bounds are at least one magnitude faster than the state-of-the-art lower bounds used in motif discovery while still keeping satisfying effectiveness. Based on these faster lower bounds, this paper designs an efficient motif discovery algorithm that significantly reduces the cost of lower bounds. The experiments conducted on real datasets show the proposed algorithm is 5.6 times faster than the state-of-the-art algorithms on average.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2239-2252"},"PeriodicalIF":8.9,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering

Pub Date : 2025-02-24 DOI: 10.1109/TKDE.2025.3545176

Yuxiang Guo;Yuren Mao;Zhonghao Hu;Lu Chen;Yunjun Gao

Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework,

${sf Snoopy}$

, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection. To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that

${sf Snoopy}$

outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency—being at least 5 orders of magnitude faster than cell-level solutions, and 3.5× faster than existing column-level methods.

{"title":"Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns","authors":"Yuxiang Guo;Yuren Mao;Zhonghao Hu;Lu Chen;Yunjun Gao","doi":"10.1109/TKDE.2025.3545176","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3545176","url":null,"abstract":"Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework, <inline-formula><tex-math>${sf Snoopy}$</tex-math></inline-formula>, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection. To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that <inline-formula><tex-math>${sf Snoopy}$</tex-math></inline-formula> outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency—being at least 5 orders of magnitude faster than cell-level solutions, and 3.5× faster than existing column-level methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2971-2985"},"PeriodicalIF":8.9,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0