Pub Date : 2026-01-29DOI: 10.1016/j.datak.2026.102561
Nicolas Gutehrlé , Panggih Kusuma Ningrum , Iana Atanassova
Scientific uncertainty is inherent to the research process and to the production of new knowledge. In this paper, we present a large-scale analysis of how scientific uncertainty is expressed in research articles. To perform this study, we analyze the Const-L dataset, which consists in 31,849 research articles across 16 disciplines published over more than two decades. To identify and categorize uncertainty expressions, we employ the UnScientify annotation system, a linguistically informed, rule-based approach. We examine the distribution of uncertainty across disciplines, over time, and within the structure of articles, and we analyze its contexts and objects. The results show that the Social Sciences and Humanities (SSH) tend to have a higher frequency of uncertainty expressions than other fields. Overall, uncertainty tends to decrease over time, though this trend varies across disciplines. Moreover, correlations can be observed between the uncertainty expressions and both article structure and length. Finally, our findings provide new insights into scientific communication, by indicating distinctive disciplinary patterns in the ways uncertainty is expressed, as well as shared and field-specific research objects associated with uncertainty.
{"title":"A large-scale multi-disciplinary analysis of uncertainty in research articles","authors":"Nicolas Gutehrlé , Panggih Kusuma Ningrum , Iana Atanassova","doi":"10.1016/j.datak.2026.102561","DOIUrl":"10.1016/j.datak.2026.102561","url":null,"abstract":"<div><div>Scientific uncertainty is inherent to the research process and to the production of new knowledge. In this paper, we present a large-scale analysis of how scientific uncertainty is expressed in research articles. To perform this study, we analyze the Const-L dataset, which consists in 31,849 research articles across 16 disciplines published over more than two decades. To identify and categorize uncertainty expressions, we employ the UnScientify annotation system, a linguistically informed, rule-based approach. We examine the distribution of uncertainty across disciplines, over time, and within the structure of articles, and we analyze its contexts and objects. The results show that the Social Sciences and Humanities (SSH) tend to have a higher frequency of uncertainty expressions than other fields. Overall, uncertainty tends to decrease over time, though this trend varies across disciplines. Moreover, correlations can be observed between the uncertainty expressions and both article structure and length. Finally, our findings provide new insights into scientific communication, by indicating distinctive disciplinary patterns in the ways uncertainty is expressed, as well as shared and field-specific research objects associated with uncertainty.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102561"},"PeriodicalIF":2.7,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness measure, such as frequency or hyper-volume. This paper presents the first sampling approach designed to handle interval patterns in numerical databases. This approach, named Fips, samples interval patterns proportionally to their frequency. It uses a multi-step sampling procedure and addresses a key challenge in numerical data: accurately determining the number of interval patterns that cover each object. We extend this work with HFips, which samples interval patterns proportionally to both their frequency and hyper-volume. These methods efficiently tackle the well-known long-tail phenomenon in pattern sampling. We formally prove that Fips and HFips sample interval patterns in proportion to their frequency and the product of hyper-volume and frequency, respectively. Through experiments on several databases, we demonstrate the quality of the obtained patterns and their robustness against the long-tail phenomenon.
{"title":"Efficiently sampling interval patterns from numerical databases","authors":"Djawad Bekkoucha , Lamine Diop , Abdelkader Ouali , Bruno Crémilleux , Patrice Boizumault","doi":"10.1016/j.datak.2026.102566","DOIUrl":"10.1016/j.datak.2026.102566","url":null,"abstract":"<div><div>Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness measure, such as frequency or hyper-volume. This paper presents the first sampling approach designed to handle interval patterns in numerical databases. This approach, named <span>Fips</span>, samples interval patterns proportionally to their frequency. It uses a multi-step sampling procedure and addresses a key challenge in numerical data: accurately determining the number of interval patterns that cover each object. We extend this work with <span>HFips</span>, which samples interval patterns proportionally to both their frequency and hyper-volume. These methods efficiently tackle the well-known long-tail phenomenon in pattern sampling. We formally prove that <span>Fips</span> and <span>HFips</span> sample interval patterns in proportion to their frequency and the product of hyper-volume and frequency, respectively. Through experiments on several databases, we demonstrate the quality of the obtained patterns and their robustness against the long-tail phenomenon.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102566"},"PeriodicalIF":2.7,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1016/j.datak.2026.102565
Nathan Le Boudec , Nicolas Voisine , Bruno Crémilleux
Uplift quantifies the impact of an action (marketing, medical treatment) on an individual’s behavior. Uplift prediction is based on the assumption that the target and control groups are equivalent. However, in real-world scenarios, customers are often selected for actions based on their prior behavior, introducing non-random assignment bias that distorts uplift estimation. This issue is even more present in the case of multi-treatment, as in the context of offer recommendation system, where multiple actions are possible for an individual. To the best of our knowledge, the effect of bias in multi-treatment uplift has not yet been studied. In this paper, we propose a novel protocol for evaluating multi-treatment uplift under non-random assignment bias. Using this protocol, we assess the performance of the main multi-treatment uplift methods from the literature. Our results show significant differences in their robustness to bias, providing valuable insights and guidelines for practical applications in biased settings.
{"title":"Multi-treatment uplift evaluation on non-random assignment biased data","authors":"Nathan Le Boudec , Nicolas Voisine , Bruno Crémilleux","doi":"10.1016/j.datak.2026.102565","DOIUrl":"10.1016/j.datak.2026.102565","url":null,"abstract":"<div><div>Uplift quantifies the impact of an action (marketing, medical treatment) on an individual’s behavior. Uplift prediction is based on the assumption that the target and control groups are equivalent. However, in real-world scenarios, customers are often selected for actions based on their prior behavior, introducing non-random assignment bias that distorts uplift estimation. This issue is even more present in the case of multi-treatment, as in the context of offer recommendation system, where multiple actions are possible for an individual. To the best of our knowledge, the effect of bias in multi-treatment uplift has not yet been studied. In this paper, we propose a novel protocol for evaluating multi-treatment uplift under non-random assignment bias. Using this protocol, we assess the performance of the main multi-treatment uplift methods from the literature. Our results show significant differences in their robustness to bias, providing valuable insights and guidelines for practical applications in biased settings.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102565"},"PeriodicalIF":2.7,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1016/j.datak.2026.102558
Antonis Bikakis , Aissatou Diallo , Luke Dickens , Anthony Hunter , Rob Miller
The ability to substitute some resource or tool for another is a common and important human ability. For example, in cooking, we often lack an ingredient for a recipe and we solve this problem by finding a substitute ingredient. There are various ways that we may reason about this. Often we need to draw on commonsense reasoning to find a substitute. For instance, we can think of the properties of the missing item, and try to find similar items with similar properties. Despite the importance of substitution in human intelligence, there is a lack of a theoretical understanding of the faculty. To address this shortcoming, we propose a commonsense reasoning framework for conceptualizing and harnessing substitution. In order to ground our proposal, we focus on cooking. Though we believe the proposal can be straightforwardly adapted to other applications that require formalization of substitution. Our approach is to produce a general framework based on distance measures for determining similarity (e.g. between ingredients, or between processing steps), and on identifying inconsistencies between the logical representation of recipes and integrity constraints that we use to flag the need for mitigation (e.g. after substituting one kind of pasta for another in a recipe, we may identify an inconsistency in the cooking time, and this is resolved by updating the cooking time).
{"title":"A commonsense reasoning framework for substitution in cooking","authors":"Antonis Bikakis , Aissatou Diallo , Luke Dickens , Anthony Hunter , Rob Miller","doi":"10.1016/j.datak.2026.102558","DOIUrl":"10.1016/j.datak.2026.102558","url":null,"abstract":"<div><div>The ability to substitute some resource or tool for another is a common and important human ability. For example, in cooking, we often lack an ingredient for a recipe and we solve this problem by finding a substitute ingredient. There are various ways that we may reason about this. Often we need to draw on commonsense reasoning to find a substitute. For instance, we can think of the properties of the missing item, and try to find similar items with similar properties. Despite the importance of substitution in human intelligence, there is a lack of a theoretical understanding of the faculty. To address this shortcoming, we propose a commonsense reasoning framework for conceptualizing and harnessing substitution. In order to ground our proposal, we focus on cooking. Though we believe the proposal can be straightforwardly adapted to other applications that require formalization of substitution. Our approach is to produce a general framework based on distance measures for determining similarity (e.g. between ingredients, or between processing steps), and on identifying inconsistencies between the logical representation of recipes and integrity constraints that we use to flag the need for mitigation (e.g. after substituting one kind of pasta for another in a recipe, we may identify an inconsistency in the cooking time, and this is resolved by updating the cooking time).</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102558"},"PeriodicalIF":2.7,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1016/j.datak.2026.102560
Ron Hochstenbach, Flavius Frasincar, Jasmijn Klinkhamer
Growing parts of the population suffer from mental health problems and psychologists lack capacity to diagnose, let alone treat, all those in need of it. Given recent advancements in the field, deep learning-based NLP techniques could help by detecting those in need of help based on their written text. To this end, this work improves the current state-of-the-art Hierarchical Attention Network (HAN) model by incorporating contextual awareness through BERT-based word embeddings and a multi-head self-attention user-encoder yielding the Context-HAN model. When trained and tested on the eRisk data sets on Self-Harm, Anorexia, and Depression, Context-HAN outperformed the HAN model across all data sets based on various evaluation measures. Furthermore, we find and discuss some interesting insights from analysis of the attention scores, such as that longer and more recently written posts are more important for classification. This work shows the potential of attention mechanisms to leverage contextual information to improve the effectiveness of NLP methods at detecting mental health disorders from user-written text.
{"title":"A contextual hierarchical attention network for detecting mental health disorders using social media","authors":"Ron Hochstenbach, Flavius Frasincar, Jasmijn Klinkhamer","doi":"10.1016/j.datak.2026.102560","DOIUrl":"10.1016/j.datak.2026.102560","url":null,"abstract":"<div><div>Growing parts of the population suffer from mental health problems and psychologists lack capacity to diagnose, let alone treat, all those in need of it. Given recent advancements in the field, deep learning-based NLP techniques could help by detecting those in need of help based on their written text. To this end, this work improves the current state-of-the-art Hierarchical Attention Network (HAN) model by incorporating contextual awareness through BERT-based word embeddings and a multi-head self-attention user-encoder yielding the Context-HAN model. When trained and tested on the eRisk data sets on Self-Harm, Anorexia, and Depression, Context-HAN outperformed the HAN model across all data sets based on various evaluation measures. Furthermore, we find and discuss some interesting insights from analysis of the attention scores, such as that longer and more recently written posts are more important for classification. This work shows the potential of attention mechanisms to leverage contextual information to improve the effectiveness of NLP methods at detecting mental health disorders from user-written text.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102560"},"PeriodicalIF":2.7,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146039183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1016/j.datak.2026.102559
Nicolas Voisine , Lou-Anne Quellet , Marc Boullé , Fabrice Clérot , Anais Collin
Multi-table data is common in organizations, and its analysis is crucial for applications such as fraud detection, service improvement, and customer relationship management. Processing this type of data requires flattening, which transforms the multi-table structure into a single flat table by creating aggregates from the original variables. Several propositionalization tools aim to automate this process, but as data complexity increases due to the number of tables and relationships, the effectiveness of flattening decreases. To enhance the quality of propositionalization, it is essential to develop automated preprocessing systems that optimize the construction of aggregates by focusing on the most informative variables.
The objective of this article is to propose a method for selecting secondary variables and to demonstrate that this approach effectively filters out non-informative variables using a univariate analysis. Finally, we will show, using a set of academic datasets, that reducing the number of secondary variables to only those that are truly informative can improve classification performance.
{"title":"Selection of secondary features from multi-table data for classification","authors":"Nicolas Voisine , Lou-Anne Quellet , Marc Boullé , Fabrice Clérot , Anais Collin","doi":"10.1016/j.datak.2026.102559","DOIUrl":"10.1016/j.datak.2026.102559","url":null,"abstract":"<div><div>Multi-table data is common in organizations, and its analysis is crucial for applications such as fraud detection, service improvement, and customer relationship management. Processing this type of data requires flattening, which transforms the multi-table structure into a single flat table by creating aggregates from the original variables. Several propositionalization tools aim to automate this process, but as data complexity increases due to the number of tables and relationships, the effectiveness of flattening decreases. To enhance the quality of propositionalization, it is essential to develop automated preprocessing systems that optimize the construction of aggregates by focusing on the most informative variables.</div><div>The objective of this article is to propose a method for selecting secondary variables and to demonstrate that this approach effectively filters out non-informative variables using a univariate analysis. Finally, we will show, using a set of academic datasets, that reducing the number of secondary variables to only those that are truly informative can improve classification performance.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102559"},"PeriodicalIF":2.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.1016/j.datak.2026.102556
Mingjun Xin, Ze He, Zhijun Xiao
The sequential recommendation task is an important research direction in the recommendation system. Previous sequential recommendation researches mainly focus on the user–item interaction sequence and mine collaborative information from it. Although these studies have achieved certain results, existing studies tend to pay less attention to other rich information, such as item description, item label, user review, etc. In fact, this rich information can aid in learning the embedding representation of items and modeling user preferences. To tackle this issue, we propose A BERT model and momentum contrastive learning based sequential recommendation method named BertMoSRec. The BERT block uses the BERT model to learn the embedding representation of items in combination with item description, item label and user review, and then uses an embedding smoothing task to obtain the isotropic semantic representation. In the momentum contrastive learning block, we use a variety of data augmentation methods to maintain a large negative sample queue, which is used to compare and learn the user item interaction sequence, learn the embedding representation of the sequences, capture user preference information, and reduce the requirements for computing resources. Extensive experiments on multiple subsets of the Amazon dataset demonstrate the effectiveness of our proposed method.
{"title":"A BERT model and momentum contrastive learning based sequential recommendation method and its implementation","authors":"Mingjun Xin, Ze He, Zhijun Xiao","doi":"10.1016/j.datak.2026.102556","DOIUrl":"10.1016/j.datak.2026.102556","url":null,"abstract":"<div><div>The sequential recommendation task is an important research direction in the recommendation system. Previous sequential recommendation researches mainly focus on the user–item interaction sequence and mine collaborative information from it. Although these studies have achieved certain results, existing studies tend to pay less attention to other rich information, such as item description, item label, user review, etc. In fact, this rich information can aid in learning the embedding representation of items and modeling user preferences. To tackle this issue, we propose A BERT model and momentum contrastive learning based sequential recommendation method named <strong>BertMoSRec</strong>. The BERT block uses the BERT model to learn the embedding representation of items in combination with item description, item label and user review, and then uses an embedding smoothing task to obtain the isotropic semantic representation. In the momentum contrastive learning block, we use a variety of data augmentation methods to maintain a large negative sample queue, which is used to compare and learn the user item interaction sequence, learn the embedding representation of the sequences, capture user preference information, and reduce the requirements for computing resources. Extensive experiments on multiple subsets of the Amazon dataset demonstrate the effectiveness of our proposed method.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102556"},"PeriodicalIF":2.7,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.1016/j.datak.2026.102557
V. Backiyalakshmi , B. Umadevi
The banking sector is significant in economic growth in each nation. Also, each and every person has a separate account in diverse banks for effectively transmitting the money at any time. The proliferation of online banking has brought about a concerning rise in fraudulent transactions, posing a persistent challenge for fraud detection. This contains a collection of fraudulent activities, as well as insurance, credit card, and accounting fraud. Despite the numerous benefits of online transactions, the prevalence of financial fraud and unauthorized transactions poses significant risks. Several researchers have constantly developed various techniques in the past few years to improve detection performance. Yet, it takes more duration for handling massive amounts of various client data sizes to detect abnormal activities. With the aim of resolving these issues, a deep learning based new approach is designed in this research work. Initially, the prescribed data are gathered from the benchmark database, then the gathered data is given to the phase of feature extraction. In this phase, the Principal Component Analysis (PCA), statistical features, and T-distributed Stochastic Neighbor Embedding (t-SNE) mechanisms are utilized to effectively extract the informative features from the collected data. It can optimally minimize the noise and irrelevant information to enhance the training speed. Then, the extracted features are combined and the optimal weighted fused features are determined by utilizing the Modified Random Value Reptile Search Algorithm (MRV-RSA) optimization algorithm. It can effectively improve the training speed and overall performance enabling better detection. Also, the optimal weighted fused features are given to the detection phase using the Dilated Convolution Long Short Term Memory (ConvLSTM) with Multi-scale Dense Attention (DCL-MDA) technique. It can handle massive complex datasets without incurring generalization problems. Further, the classified detected result is provided with a limited duration. Therefore, the efficiency of the model is validated by using the different metrics and contrasted over other traditional models. Hence, the suggested system overwhelms the desired value for finding the fraudulent user to enhance the security level in the banking sector. From the evaluation process, the implemented framework has attained a reliable accuracy rate of 93.86% in Dataset 1 and 97.15% in Dataset 2 to prove its superior performance. This performance enhancement in the developed model could accurately detect fraud at an earlier stage.
{"title":"MRV-RSA: Developed Modified Random Value Reptile Search Algorithm and Deep Learning based Fraud Detection Model in Banking Sector","authors":"V. Backiyalakshmi , B. Umadevi","doi":"10.1016/j.datak.2026.102557","DOIUrl":"10.1016/j.datak.2026.102557","url":null,"abstract":"<div><div>The banking sector is significant in economic growth in each nation. Also, each and every person has a separate account in diverse banks for effectively transmitting the money at any time. The proliferation of online banking has brought about a concerning rise in fraudulent transactions, posing a persistent challenge for fraud detection. This contains a collection of fraudulent activities, as well as insurance, credit card, and accounting fraud. Despite the numerous benefits of online transactions, the prevalence of financial fraud and unauthorized transactions poses significant risks. Several researchers have constantly developed various techniques in the past few years to improve detection performance. Yet, it takes more duration for handling massive amounts of various client data sizes to detect abnormal activities. With the aim of resolving these issues, a deep learning based new approach is designed in this research work. Initially, the prescribed data are gathered from the benchmark database, then the gathered data is given to the phase of feature extraction. In this phase, the Principal Component Analysis (PCA), statistical features, and T-distributed Stochastic Neighbor Embedding (t-SNE) mechanisms are utilized to effectively extract the informative features from the collected data. It can optimally minimize the noise and irrelevant information to enhance the training speed. Then, the extracted features are combined and the optimal weighted fused features are determined by utilizing the Modified Random Value Reptile Search Algorithm (MRV-RSA) optimization algorithm. It can effectively improve the training speed and overall performance enabling better detection. Also, the optimal weighted fused features are given to the detection phase using the Dilated Convolution Long Short Term Memory (ConvLSTM) with Multi-scale Dense Attention (DCL-MDA) technique. It can handle massive complex datasets without incurring generalization problems. Further, the classified detected result is provided with a limited duration. Therefore, the efficiency of the model is validated by using the different metrics and contrasted over other traditional models. Hence, the suggested system overwhelms the desired value for finding the fraudulent user to enhance the security level in the banking sector. From the evaluation process, the implemented framework has attained a reliable accuracy rate of 93.86% in Dataset 1 and 97.15% in Dataset 2 to prove its superior performance. This performance enhancement in the developed model could accurately detect fraud at an earlier stage.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102557"},"PeriodicalIF":2.7,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.datak.2025.102553
Tugba Turkoglu Kaya
Aggregation strategies in group recommender systems often fall short in balancing diverse user preferences and ensuring fair satisfaction within the group. These limitations become more pronounced in single-criteria frameworks, where the multidimensional nature of user–item interactions is overlooked, thereby restricting the system’s capacity to capture subtle preference variations. While multi-criteria recommendation offers a promising solution by incorporating multiple evaluation dimensions, the adaptation of single-criteria aggregation mechanisms to a multi-criteria setting remains an open research question. For the purpose, in the study, new aggregation techniques and top-n recommendation system mechanism are developed for a new multi-criteria group recommendation system. While user tendencies and qualitative sequences of user evaluations are taken into account in the new combining techniques called weighted preference aggregation, preference without weighted aggregation and weighted without preference vector aggregation the newly developed top-n recommendation system aims to prepare a recommendation list according to group tendencies by using product characteristic structures. In the studies carried out on two different data sets (Yahoo!Movies, TripAdvisor) for three group size (1, 5, 10%), a comparative analysis of each of the proposed methods is made with the methods available in the literature. When the results are examined, it is seen that the proposed methods give very successful results.
{"title":"The tendency-based multi-criteria group recommendation systems","authors":"Tugba Turkoglu Kaya","doi":"10.1016/j.datak.2025.102553","DOIUrl":"10.1016/j.datak.2025.102553","url":null,"abstract":"<div><div>Aggregation strategies in group recommender systems often fall short in balancing diverse user preferences and ensuring fair satisfaction within the group. These limitations become more pronounced in single-criteria frameworks, where the multidimensional nature of user–item interactions is overlooked, thereby restricting the system’s capacity to capture subtle preference variations. While multi-criteria recommendation offers a promising solution by incorporating multiple evaluation dimensions, the adaptation of single-criteria aggregation mechanisms to a multi-criteria setting remains an open research question. For the purpose, in the study, new aggregation techniques and top-<em>n</em> recommendation system mechanism are developed for a new multi-criteria group recommendation system. While user tendencies and qualitative sequences of user evaluations are taken into account in the new combining techniques called weighted preference aggregation, preference without weighted aggregation and weighted without preference vector aggregation the newly developed top-<em>n</em> recommendation system aims to prepare a recommendation list according to group tendencies by using product characteristic structures. In the studies carried out on two different data sets (Yahoo!Movies, TripAdvisor) for three group size (1, 5, 10%), a comparative analysis of each of the proposed methods is made with the methods available in the literature. When the results are examined, it is seen that the proposed methods give very successful results.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"162 ","pages":"Article 102553"},"PeriodicalIF":2.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.datak.2026.102554
Konstantinos Bougiatiotis , Georgios Paliouras
Multi-relational networks capture intricate relationships in data and have diverse applications across fields such as biomedical, financial, and social sciences. As networks derived from increasingly large datasets become more common, identifying efficient methods for representing and analyzing them becomes crucial. This work extends the Prime Adjacency Matrices (PAMs) framework, which employs prime numbers to represent distinct relations within a network uniquely. This enables a compact representation of a complete multi-relational graph using a single adjacency matrix, which, in turn, facilitates quick computation of multi-hop adjacency matrices. In this work, we enhance the framework by introducing a lossless algorithm for calculating the multi-hop matrices and propose the Bag of Paths (BoP) representation, a versatile feature extraction methodology for various graph analytics tasks, at the node, edge, and graph level. We demonstrate the efficiency of the framework across various tasks and datasets, showing that simple BoP-based models perform comparably to or better than commonly used neural models while improving speed by orders of magnitude.
{"title":"From primes to paths: Enabling fast multi-relational graph analysis","authors":"Konstantinos Bougiatiotis , Georgios Paliouras","doi":"10.1016/j.datak.2026.102554","DOIUrl":"10.1016/j.datak.2026.102554","url":null,"abstract":"<div><div>Multi-relational networks capture intricate relationships in data and have diverse applications across fields such as biomedical, financial, and social sciences. As networks derived from increasingly large datasets become more common, identifying efficient methods for representing and analyzing them becomes crucial. This work extends the Prime Adjacency Matrices (PAMs) framework, which employs prime numbers to represent distinct relations within a network uniquely. This enables a compact representation of a complete multi-relational graph using a single adjacency matrix, which, in turn, facilitates quick computation of multi-hop adjacency matrices. In this work, we enhance the framework by introducing a lossless algorithm for calculating the multi-hop matrices and propose the Bag of Paths (BoP) representation, a versatile feature extraction methodology for various graph analytics tasks, at the node, edge, and graph level. We demonstrate the efficiency of the framework across various tasks and datasets, showing that simple BoP-based models perform comparably to or better than commonly used neural models while improving speed by orders of magnitude.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102554"},"PeriodicalIF":2.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145941279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}