Large knowledge graphs such as DBpedia, Wikidata, and YAGO are rich sources of structured information, widely used in domains like information retrieval and recommendation. However, their potential for supporting knowledge workers remains underexploited. As in other complex networks where specialized metrics have emerged (e.g., bibliometrics), knowledge graphs offer promising opportunities for domain-specific analysis — particularly in areas lacking established quantitative tools. In this paper, we present , a web application for knowledge workers to analyze any entity from Wikidata despite its heterogeneity and sheer volume. For this purpose, we introduce complementary indicators to position or evaluate any Wikidata entity within its domain, addressing analysis tasks that are challenging for traditional methods. We propose an on-demand analysis architecture that distributes computation to clients while centralizing results for frugality, by benefiting from a client-side SPARQL parallelization engine (SParaQL). We demonstrate the effectiveness of SParaQL through performance tests on DBpedia, Wikidata, and YAGO with two analytical queries, as well as via a real-world deployment including caching of over 10,000 entities and a user study.
{"title":"Rankingdom: A cooperative architecture for the on-demand analysis of Wikidata","authors":"Hassan Abdallah , Béatrice Markhoff , Manon Ovide , Louise Parkin , Arnaud Soulet","doi":"10.1016/j.datak.2026.102564","DOIUrl":"10.1016/j.datak.2026.102564","url":null,"abstract":"<div><div>Large knowledge graphs such as DBpedia, Wikidata, and YAGO are rich sources of structured information, widely used in domains like information retrieval and recommendation. However, their potential for supporting knowledge workers remains underexploited. As in other complex networks where specialized metrics have emerged (e.g., bibliometrics), knowledge graphs offer promising opportunities for domain-specific analysis — particularly in areas lacking established quantitative tools. In this paper, we present <span><math><mtext>Rankingdom</mtext></math></span>, a web application for knowledge workers to analyze any entity from Wikidata despite its heterogeneity and sheer volume. For this purpose, we introduce complementary indicators to position or evaluate any Wikidata entity within its domain, addressing analysis tasks that are challenging for traditional methods. We propose an on-demand analysis architecture that distributes computation to clients while centralizing results for frugality, by benefiting from a client-side SPARQL parallelization engine (<span>SParaQL</span>). We demonstrate the effectiveness of <span>SParaQL</span> through performance tests on DBpedia, Wikidata, and YAGO with two analytical queries, as well as via a real-world deployment including caching of over 10,000 entities and a user study.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"164 ","pages":"Article 102564"},"PeriodicalIF":2.7,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2026-02-05DOI: 10.1016/j.datak.2026.102575
Mohammad Hossein Moslemi , Amir Mousavi , Behshid Behkamal , Mostafa Milani
Entity matching (EM) is a fundamental task in data integration and analytics, essential for identifying records that refer to the same real-world entity across diverse sources. In practice, datasets often differ widely in structure, format, schema, and semantics, creating substantial challenges for EM. We refer to this setting as Heterogeneous EM (HEM).
This survey offers a unified perspective on HEM by introducing a taxonomy, grounded in prior work, that distinguishes two primary categories–representation and semantic heterogeneity–and their subtypes. The taxonomy provides a systematic lens for understanding how variations in data form and meaning shape the complexity of matching tasks. We then connect this framework to the FAIR principles–Findability, Accessibility, Interoperability, and Reusability–demonstrating how they both reveal the challenges of HEM and suggest strategies for mitigating them.
Building on this foundation, we critically review recent EM methods, examining their ability to address different heterogeneity types, and conduct targeted experiments on state-of-the-art models to evaluate their robustness and adaptability under semantic heterogeneity. Our analysis uncovers persistent limitations in current approaches and points to promising directions for future research, including multimodal matching, human-in-the-loop workflows, deeper integration with large language models and knowledge graphs, and fairness-aware evaluation in heterogeneous settings.
{"title":"Heterogeneity in entity matching: A survey and experimental analysis","authors":"Mohammad Hossein Moslemi , Amir Mousavi , Behshid Behkamal , Mostafa Milani","doi":"10.1016/j.datak.2026.102575","DOIUrl":"10.1016/j.datak.2026.102575","url":null,"abstract":"<div><div>Entity matching (EM) is a fundamental task in data integration and analytics, essential for identifying records that refer to the same real-world entity across diverse sources. In practice, datasets often differ widely in structure, format, schema, and semantics, creating substantial challenges for EM. We refer to this setting as <em>Heterogeneous EM (HEM)</em>.</div><div>This survey offers a unified perspective on HEM by introducing a taxonomy, grounded in prior work, that distinguishes two primary categories–<em>representation</em> and <em>semantic heterogeneity</em>–and their subtypes. The taxonomy provides a systematic lens for understanding how variations in data form and meaning shape the complexity of matching tasks. We then connect this framework to the <em>FAIR principles</em>–<em>Findability</em>, <em>Accessibility</em>, <em>Interoperability</em>, and <em>Reusability</em>–demonstrating how they both reveal the challenges of HEM and suggest strategies for mitigating them.</div><div>Building on this foundation, we critically review recent EM methods, examining their ability to address different heterogeneity types, and conduct targeted experiments on state-of-the-art models to evaluate their robustness and adaptability under semantic heterogeneity. Our analysis uncovers persistent limitations in current approaches and points to promising directions for future research, including multimodal matching, human-in-the-loop workflows, deeper integration with large language models and knowledge graphs, and fairness-aware evaluation in heterogeneous settings.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"164 ","pages":"Article 102575"},"PeriodicalIF":2.7,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2026-02-10DOI: 10.1016/j.datak.2026.102577
Md Arif Rahman , Syed Jalaluddin Hashmi , Kethsiya Gnanajothy , Young-Koo Lee
In modern data-intensive applications, optimizing procedural SQL queries in imperative programs is challenging because of inefficient access to intermediate results that are stored in temporary memory. Both the traditional and machine learning-based optimization techniques often fail to create indexes on intermediate results because they focus on previously stored base tables and declarative queries. However, indexing on intermediate results can be a potential solution to further reducing the cost of processing dependent queries. To the best of our knowledge, no existing work has considered reducing the processing cost of dependent queries that rely on intermediate results. Inspired by this issue, we introduce the intermediate index, a temporary index created on intermediate results within the scope of a single program execution. Leveraging the intermediate index, we propose a procedural SQL query optimization technique called AutoCox that identifies and evaluates the benefits of indexes using a novel what-if analysis method. AutoCox dynamically determines the need for indexing based on the producer–consumer relationships, cardinality, and selectivity of intermediate results, accounting for both index creation overhead and runtime reuse. AutoCox ensures that intermediate indexes hold up-to-date data, which are created only when beneficial and automatically dropped afterward. Experimental results show that our approach significantly outperforms existing methods, achieving a magnitude of 67% cost reduction of dependent queries and 261 speedups of the imperative program using a workload. It indicates that we effectively bridge a critical gap in the optimization of procedural SQL queries in relational big data processing environments.
{"title":"When temporary results meet intermediate index: An optimization technique of procedural SQL query processing","authors":"Md Arif Rahman , Syed Jalaluddin Hashmi , Kethsiya Gnanajothy , Young-Koo Lee","doi":"10.1016/j.datak.2026.102577","DOIUrl":"10.1016/j.datak.2026.102577","url":null,"abstract":"<div><div>In modern data-intensive applications, optimizing procedural SQL queries in imperative programs is challenging because of inefficient access to intermediate results that are stored in temporary memory. Both the traditional and machine learning-based optimization techniques often fail to create indexes on intermediate results because they focus on previously stored base tables and declarative queries. However, indexing on intermediate results can be a potential solution to further reducing the cost of processing dependent queries. To the best of our knowledge, no existing work has considered reducing the processing cost of dependent queries that rely on intermediate results. Inspired by this issue, we introduce the intermediate index, a temporary index created on intermediate results within the scope of a single program execution. Leveraging the intermediate index, we propose a procedural SQL query optimization technique called <em>AutoCox</em> that identifies and evaluates the benefits of indexes using a novel <em>what-if</em> analysis method. <em>AutoCox</em> dynamically determines the need for indexing based on the producer–consumer relationships, cardinality, and selectivity of intermediate results, accounting for both index creation overhead and runtime reuse. <em>AutoCox</em> ensures that intermediate indexes hold up-to-date data, which are created only when beneficial and automatically dropped afterward. Experimental results show that our approach significantly outperforms existing methods, achieving a magnitude of 67% cost reduction of dependent queries and 261<span><math><mo>×</mo></math></span> speedups of the imperative program using a workload. It indicates that we effectively bridge a critical gap in the optimization of procedural SQL queries in relational big data processing environments.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"164 ","pages":"Article 102577"},"PeriodicalIF":2.7,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-05-01Epub Date: 2026-01-08DOI: 10.1016/j.datak.2026.102556
Mingjun Xin, Ze He, Zhijun Xiao
The sequential recommendation task is an important research direction in the recommendation system. Previous sequential recommendation researches mainly focus on the user–item interaction sequence and mine collaborative information from it. Although these studies have achieved certain results, existing studies tend to pay less attention to other rich information, such as item description, item label, user review, etc. In fact, this rich information can aid in learning the embedding representation of items and modeling user preferences. To tackle this issue, we propose A BERT model and momentum contrastive learning based sequential recommendation method named BertMoSRec. The BERT block uses the BERT model to learn the embedding representation of items in combination with item description, item label and user review, and then uses an embedding smoothing task to obtain the isotropic semantic representation. In the momentum contrastive learning block, we use a variety of data augmentation methods to maintain a large negative sample queue, which is used to compare and learn the user item interaction sequence, learn the embedding representation of the sequences, capture user preference information, and reduce the requirements for computing resources. Extensive experiments on multiple subsets of the Amazon dataset demonstrate the effectiveness of our proposed method.
{"title":"A BERT model and momentum contrastive learning based sequential recommendation method and its implementation","authors":"Mingjun Xin, Ze He, Zhijun Xiao","doi":"10.1016/j.datak.2026.102556","DOIUrl":"10.1016/j.datak.2026.102556","url":null,"abstract":"<div><div>The sequential recommendation task is an important research direction in the recommendation system. Previous sequential recommendation researches mainly focus on the user–item interaction sequence and mine collaborative information from it. Although these studies have achieved certain results, existing studies tend to pay less attention to other rich information, such as item description, item label, user review, etc. In fact, this rich information can aid in learning the embedding representation of items and modeling user preferences. To tackle this issue, we propose A BERT model and momentum contrastive learning based sequential recommendation method named <strong>BertMoSRec</strong>. The BERT block uses the BERT model to learn the embedding representation of items in combination with item description, item label and user review, and then uses an embedding smoothing task to obtain the isotropic semantic representation. In the momentum contrastive learning block, we use a variety of data augmentation methods to maintain a large negative sample queue, which is used to compare and learn the user item interaction sequence, learn the embedding representation of the sequences, capture user preference information, and reduce the requirements for computing resources. Extensive experiments on multiple subsets of the Amazon dataset demonstrate the effectiveness of our proposed method.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102556"},"PeriodicalIF":2.7,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-05-01Epub Date: 2026-01-20DOI: 10.1016/j.datak.2026.102559
Nicolas Voisine , Lou-Anne Quellet , Marc Boullé , Fabrice Clérot , Anais Collin
Multi-table data is common in organizations, and its analysis is crucial for applications such as fraud detection, service improvement, and customer relationship management. Processing this type of data requires flattening, which transforms the multi-table structure into a single flat table by creating aggregates from the original variables. Several propositionalization tools aim to automate this process, but as data complexity increases due to the number of tables and relationships, the effectiveness of flattening decreases. To enhance the quality of propositionalization, it is essential to develop automated preprocessing systems that optimize the construction of aggregates by focusing on the most informative variables.
The objective of this article is to propose a method for selecting secondary variables and to demonstrate that this approach effectively filters out non-informative variables using a univariate analysis. Finally, we will show, using a set of academic datasets, that reducing the number of secondary variables to only those that are truly informative can improve classification performance.
{"title":"Selection of secondary features from multi-table data for classification","authors":"Nicolas Voisine , Lou-Anne Quellet , Marc Boullé , Fabrice Clérot , Anais Collin","doi":"10.1016/j.datak.2026.102559","DOIUrl":"10.1016/j.datak.2026.102559","url":null,"abstract":"<div><div>Multi-table data is common in organizations, and its analysis is crucial for applications such as fraud detection, service improvement, and customer relationship management. Processing this type of data requires flattening, which transforms the multi-table structure into a single flat table by creating aggregates from the original variables. Several propositionalization tools aim to automate this process, but as data complexity increases due to the number of tables and relationships, the effectiveness of flattening decreases. To enhance the quality of propositionalization, it is essential to develop automated preprocessing systems that optimize the construction of aggregates by focusing on the most informative variables.</div><div>The objective of this article is to propose a method for selecting secondary variables and to demonstrate that this approach effectively filters out non-informative variables using a univariate analysis. Finally, we will show, using a set of academic datasets, that reducing the number of secondary variables to only those that are truly informative can improve classification performance.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102559"},"PeriodicalIF":2.7,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-05-01Epub Date: 2026-01-08DOI: 10.1016/j.datak.2026.102557
V. Backiyalakshmi , B. Umadevi
The banking sector is significant in economic growth in each nation. Also, each and every person has a separate account in diverse banks for effectively transmitting the money at any time. The proliferation of online banking has brought about a concerning rise in fraudulent transactions, posing a persistent challenge for fraud detection. This contains a collection of fraudulent activities, as well as insurance, credit card, and accounting fraud. Despite the numerous benefits of online transactions, the prevalence of financial fraud and unauthorized transactions poses significant risks. Several researchers have constantly developed various techniques in the past few years to improve detection performance. Yet, it takes more duration for handling massive amounts of various client data sizes to detect abnormal activities. With the aim of resolving these issues, a deep learning based new approach is designed in this research work. Initially, the prescribed data are gathered from the benchmark database, then the gathered data is given to the phase of feature extraction. In this phase, the Principal Component Analysis (PCA), statistical features, and T-distributed Stochastic Neighbor Embedding (t-SNE) mechanisms are utilized to effectively extract the informative features from the collected data. It can optimally minimize the noise and irrelevant information to enhance the training speed. Then, the extracted features are combined and the optimal weighted fused features are determined by utilizing the Modified Random Value Reptile Search Algorithm (MRV-RSA) optimization algorithm. It can effectively improve the training speed and overall performance enabling better detection. Also, the optimal weighted fused features are given to the detection phase using the Dilated Convolution Long Short Term Memory (ConvLSTM) with Multi-scale Dense Attention (DCL-MDA) technique. It can handle massive complex datasets without incurring generalization problems. Further, the classified detected result is provided with a limited duration. Therefore, the efficiency of the model is validated by using the different metrics and contrasted over other traditional models. Hence, the suggested system overwhelms the desired value for finding the fraudulent user to enhance the security level in the banking sector. From the evaluation process, the implemented framework has attained a reliable accuracy rate of 93.86% in Dataset 1 and 97.15% in Dataset 2 to prove its superior performance. This performance enhancement in the developed model could accurately detect fraud at an earlier stage.
{"title":"MRV-RSA: Developed Modified Random Value Reptile Search Algorithm and Deep Learning based Fraud Detection Model in Banking Sector","authors":"V. Backiyalakshmi , B. Umadevi","doi":"10.1016/j.datak.2026.102557","DOIUrl":"10.1016/j.datak.2026.102557","url":null,"abstract":"<div><div>The banking sector is significant in economic growth in each nation. Also, each and every person has a separate account in diverse banks for effectively transmitting the money at any time. The proliferation of online banking has brought about a concerning rise in fraudulent transactions, posing a persistent challenge for fraud detection. This contains a collection of fraudulent activities, as well as insurance, credit card, and accounting fraud. Despite the numerous benefits of online transactions, the prevalence of financial fraud and unauthorized transactions poses significant risks. Several researchers have constantly developed various techniques in the past few years to improve detection performance. Yet, it takes more duration for handling massive amounts of various client data sizes to detect abnormal activities. With the aim of resolving these issues, a deep learning based new approach is designed in this research work. Initially, the prescribed data are gathered from the benchmark database, then the gathered data is given to the phase of feature extraction. In this phase, the Principal Component Analysis (PCA), statistical features, and T-distributed Stochastic Neighbor Embedding (t-SNE) mechanisms are utilized to effectively extract the informative features from the collected data. It can optimally minimize the noise and irrelevant information to enhance the training speed. Then, the extracted features are combined and the optimal weighted fused features are determined by utilizing the Modified Random Value Reptile Search Algorithm (MRV-RSA) optimization algorithm. It can effectively improve the training speed and overall performance enabling better detection. Also, the optimal weighted fused features are given to the detection phase using the Dilated Convolution Long Short Term Memory (ConvLSTM) with Multi-scale Dense Attention (DCL-MDA) technique. It can handle massive complex datasets without incurring generalization problems. Further, the classified detected result is provided with a limited duration. Therefore, the efficiency of the model is validated by using the different metrics and contrasted over other traditional models. Hence, the suggested system overwhelms the desired value for finding the fraudulent user to enhance the security level in the banking sector. From the evaluation process, the implemented framework has attained a reliable accuracy rate of 93.86% in Dataset 1 and 97.15% in Dataset 2 to prove its superior performance. This performance enhancement in the developed model could accurately detect fraud at an earlier stage.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102557"},"PeriodicalIF":2.7,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-05-01Epub Date: 2026-01-29DOI: 10.1016/j.datak.2026.102561
Nicolas Gutehrlé , Panggih Kusuma Ningrum , Iana Atanassova
Scientific uncertainty is inherent to the research process and to the production of new knowledge. In this paper, we present a large-scale analysis of how scientific uncertainty is expressed in research articles. To perform this study, we analyze the Const-L dataset, which consists in 31,849 research articles across 16 disciplines published over more than two decades. To identify and categorize uncertainty expressions, we employ the UnScientify annotation system, a linguistically informed, rule-based approach. We examine the distribution of uncertainty across disciplines, over time, and within the structure of articles, and we analyze its contexts and objects. The results show that the Social Sciences and Humanities (SSH) tend to have a higher frequency of uncertainty expressions than other fields. Overall, uncertainty tends to decrease over time, though this trend varies across disciplines. Moreover, correlations can be observed between the uncertainty expressions and both article structure and length. Finally, our findings provide new insights into scientific communication, by indicating distinctive disciplinary patterns in the ways uncertainty is expressed, as well as shared and field-specific research objects associated with uncertainty.
{"title":"A large-scale multi-disciplinary analysis of uncertainty in research articles","authors":"Nicolas Gutehrlé , Panggih Kusuma Ningrum , Iana Atanassova","doi":"10.1016/j.datak.2026.102561","DOIUrl":"10.1016/j.datak.2026.102561","url":null,"abstract":"<div><div>Scientific uncertainty is inherent to the research process and to the production of new knowledge. In this paper, we present a large-scale analysis of how scientific uncertainty is expressed in research articles. To perform this study, we analyze the Const-L dataset, which consists in 31,849 research articles across 16 disciplines published over more than two decades. To identify and categorize uncertainty expressions, we employ the UnScientify annotation system, a linguistically informed, rule-based approach. We examine the distribution of uncertainty across disciplines, over time, and within the structure of articles, and we analyze its contexts and objects. The results show that the Social Sciences and Humanities (SSH) tend to have a higher frequency of uncertainty expressions than other fields. Overall, uncertainty tends to decrease over time, though this trend varies across disciplines. Moreover, correlations can be observed between the uncertainty expressions and both article structure and length. Finally, our findings provide new insights into scientific communication, by indicating distinctive disciplinary patterns in the ways uncertainty is expressed, as well as shared and field-specific research objects associated with uncertainty.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102561"},"PeriodicalIF":2.7,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness measure, such as frequency or hyper-volume. This paper presents the first sampling approach designed to handle interval patterns in numerical databases. This approach, named Fips, samples interval patterns proportionally to their frequency. It uses a multi-step sampling procedure and addresses a key challenge in numerical data: accurately determining the number of interval patterns that cover each object. We extend this work with HFips, which samples interval patterns proportionally to both their frequency and hyper-volume. These methods efficiently tackle the well-known long-tail phenomenon in pattern sampling. We formally prove that Fips and HFips sample interval patterns in proportion to their frequency and the product of hyper-volume and frequency, respectively. Through experiments on several databases, we demonstrate the quality of the obtained patterns and their robustness against the long-tail phenomenon.
{"title":"Efficiently sampling interval patterns from numerical databases","authors":"Djawad Bekkoucha , Lamine Diop , Abdelkader Ouali , Bruno Crémilleux , Patrice Boizumault","doi":"10.1016/j.datak.2026.102566","DOIUrl":"10.1016/j.datak.2026.102566","url":null,"abstract":"<div><div>Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness measure, such as frequency or hyper-volume. This paper presents the first sampling approach designed to handle interval patterns in numerical databases. This approach, named <span>Fips</span>, samples interval patterns proportionally to their frequency. It uses a multi-step sampling procedure and addresses a key challenge in numerical data: accurately determining the number of interval patterns that cover each object. We extend this work with <span>HFips</span>, which samples interval patterns proportionally to both their frequency and hyper-volume. These methods efficiently tackle the well-known long-tail phenomenon in pattern sampling. We formally prove that <span>Fips</span> and <span>HFips</span> sample interval patterns in proportion to their frequency and the product of hyper-volume and frequency, respectively. Through experiments on several databases, we demonstrate the quality of the obtained patterns and their robustness against the long-tail phenomenon.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102566"},"PeriodicalIF":2.7,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-05-01Epub Date: 2026-01-22DOI: 10.1016/j.datak.2026.102558
Antonis Bikakis , Aissatou Diallo , Luke Dickens , Anthony Hunter , Rob Miller
The ability to substitute some resource or tool for another is a common and important human ability. For example, in cooking, we often lack an ingredient for a recipe and we solve this problem by finding a substitute ingredient. There are various ways that we may reason about this. Often we need to draw on commonsense reasoning to find a substitute. For instance, we can think of the properties of the missing item, and try to find similar items with similar properties. Despite the importance of substitution in human intelligence, there is a lack of a theoretical understanding of the faculty. To address this shortcoming, we propose a commonsense reasoning framework for conceptualizing and harnessing substitution. In order to ground our proposal, we focus on cooking. Though we believe the proposal can be straightforwardly adapted to other applications that require formalization of substitution. Our approach is to produce a general framework based on distance measures for determining similarity (e.g. between ingredients, or between processing steps), and on identifying inconsistencies between the logical representation of recipes and integrity constraints that we use to flag the need for mitigation (e.g. after substituting one kind of pasta for another in a recipe, we may identify an inconsistency in the cooking time, and this is resolved by updating the cooking time).
{"title":"A commonsense reasoning framework for substitution in cooking","authors":"Antonis Bikakis , Aissatou Diallo , Luke Dickens , Anthony Hunter , Rob Miller","doi":"10.1016/j.datak.2026.102558","DOIUrl":"10.1016/j.datak.2026.102558","url":null,"abstract":"<div><div>The ability to substitute some resource or tool for another is a common and important human ability. For example, in cooking, we often lack an ingredient for a recipe and we solve this problem by finding a substitute ingredient. There are various ways that we may reason about this. Often we need to draw on commonsense reasoning to find a substitute. For instance, we can think of the properties of the missing item, and try to find similar items with similar properties. Despite the importance of substitution in human intelligence, there is a lack of a theoretical understanding of the faculty. To address this shortcoming, we propose a commonsense reasoning framework for conceptualizing and harnessing substitution. In order to ground our proposal, we focus on cooking. Though we believe the proposal can be straightforwardly adapted to other applications that require formalization of substitution. Our approach is to produce a general framework based on distance measures for determining similarity (e.g. between ingredients, or between processing steps), and on identifying inconsistencies between the logical representation of recipes and integrity constraints that we use to flag the need for mitigation (e.g. after substituting one kind of pasta for another in a recipe, we may identify an inconsistency in the cooking time, and this is resolved by updating the cooking time).</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102558"},"PeriodicalIF":2.7,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-05-01Epub Date: 2026-01-07DOI: 10.1016/j.datak.2026.102554
Konstantinos Bougiatiotis , Georgios Paliouras
Multi-relational networks capture intricate relationships in data and have diverse applications across fields such as biomedical, financial, and social sciences. As networks derived from increasingly large datasets become more common, identifying efficient methods for representing and analyzing them becomes crucial. This work extends the Prime Adjacency Matrices (PAMs) framework, which employs prime numbers to represent distinct relations within a network uniquely. This enables a compact representation of a complete multi-relational graph using a single adjacency matrix, which, in turn, facilitates quick computation of multi-hop adjacency matrices. In this work, we enhance the framework by introducing a lossless algorithm for calculating the multi-hop matrices and propose the Bag of Paths (BoP) representation, a versatile feature extraction methodology for various graph analytics tasks, at the node, edge, and graph level. We demonstrate the efficiency of the framework across various tasks and datasets, showing that simple BoP-based models perform comparably to or better than commonly used neural models while improving speed by orders of magnitude.
{"title":"From primes to paths: Enabling fast multi-relational graph analysis","authors":"Konstantinos Bougiatiotis , Georgios Paliouras","doi":"10.1016/j.datak.2026.102554","DOIUrl":"10.1016/j.datak.2026.102554","url":null,"abstract":"<div><div>Multi-relational networks capture intricate relationships in data and have diverse applications across fields such as biomedical, financial, and social sciences. As networks derived from increasingly large datasets become more common, identifying efficient methods for representing and analyzing them becomes crucial. This work extends the Prime Adjacency Matrices (PAMs) framework, which employs prime numbers to represent distinct relations within a network uniquely. This enables a compact representation of a complete multi-relational graph using a single adjacency matrix, which, in turn, facilitates quick computation of multi-hop adjacency matrices. In this work, we enhance the framework by introducing a lossless algorithm for calculating the multi-hop matrices and propose the Bag of Paths (BoP) representation, a versatile feature extraction methodology for various graph analytics tasks, at the node, edge, and graph level. We demonstrate the efficiency of the framework across various tasks and datasets, showing that simple BoP-based models perform comparably to or better than commonly used neural models while improving speed by orders of magnitude.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102554"},"PeriodicalIF":2.7,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145941279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}