首页 > 最新文献

Data & Knowledge Engineering最新文献

英文 中文
A large-scale multi-disciplinary analysis of uncertainty in research articles 研究文章中不确定性的大规模多学科分析
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-29 DOI: 10.1016/j.datak.2026.102561
Nicolas Gutehrlé , Panggih Kusuma Ningrum , Iana Atanassova
Scientific uncertainty is inherent to the research process and to the production of new knowledge. In this paper, we present a large-scale analysis of how scientific uncertainty is expressed in research articles. To perform this study, we analyze the Const-L dataset, which consists in 31,849 research articles across 16 disciplines published over more than two decades. To identify and categorize uncertainty expressions, we employ the UnScientify annotation system, a linguistically informed, rule-based approach. We examine the distribution of uncertainty across disciplines, over time, and within the structure of articles, and we analyze its contexts and objects. The results show that the Social Sciences and Humanities (SSH) tend to have a higher frequency of uncertainty expressions than other fields. Overall, uncertainty tends to decrease over time, though this trend varies across disciplines. Moreover, correlations can be observed between the uncertainty expressions and both article structure and length. Finally, our findings provide new insights into scientific communication, by indicating distinctive disciplinary patterns in the ways uncertainty is expressed, as well as shared and field-specific research objects associated with uncertainty.
科学的不确定性是研究过程和新知识产生所固有的。在本文中,我们对科学不确定性如何在研究文章中表达进行了大规模分析。为了进行这项研究,我们分析了Const-L数据集,该数据集由20多年来发表的16个学科的31,849篇研究文章组成。为了识别和分类不确定性表达式,我们采用了UnScientify注释系统,这是一种基于语言的、基于规则的方法。我们研究了不确定性在学科、时间和文章结构中的分布,并分析了其背景和对象。结果表明,社会科学与人文学科的不确定性表达频率高于其他学科。总的来说,不确定性随着时间的推移而减少,尽管这种趋势在不同学科之间有所不同。此外,不确定性表达式与文章结构和长度之间存在相关性。最后,我们的研究结果通过指出不确定性表达方式的独特学科模式,以及与不确定性相关的共享和特定领域的研究对象,为科学传播提供了新的见解。
{"title":"A large-scale multi-disciplinary analysis of uncertainty in research articles","authors":"Nicolas Gutehrlé ,&nbsp;Panggih Kusuma Ningrum ,&nbsp;Iana Atanassova","doi":"10.1016/j.datak.2026.102561","DOIUrl":"10.1016/j.datak.2026.102561","url":null,"abstract":"<div><div>Scientific uncertainty is inherent to the research process and to the production of new knowledge. In this paper, we present a large-scale analysis of how scientific uncertainty is expressed in research articles. To perform this study, we analyze the Const-L dataset, which consists in 31,849 research articles across 16 disciplines published over more than two decades. To identify and categorize uncertainty expressions, we employ the UnScientify annotation system, a linguistically informed, rule-based approach. We examine the distribution of uncertainty across disciplines, over time, and within the structure of articles, and we analyze its contexts and objects. The results show that the Social Sciences and Humanities (SSH) tend to have a higher frequency of uncertainty expressions than other fields. Overall, uncertainty tends to decrease over time, though this trend varies across disciplines. Moreover, correlations can be observed between the uncertainty expressions and both article structure and length. Finally, our findings provide new insights into scientific communication, by indicating distinctive disciplinary patterns in the ways uncertainty is expressed, as well as shared and field-specific research objects associated with uncertainty.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102561"},"PeriodicalIF":2.7,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficiently sampling interval patterns from numerical databases 有效地从数值数据库中采样间隔模式
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-29 DOI: 10.1016/j.datak.2026.102566
Djawad Bekkoucha , Lamine Diop , Abdelkader Ouali , Bruno Crémilleux , Patrice Boizumault
Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness measure, such as frequency or hyper-volume. This paper presents the first sampling approach designed to handle interval patterns in numerical databases. This approach, named Fips, samples interval patterns proportionally to their frequency. It uses a multi-step sampling procedure and addresses a key challenge in numerical data: accurately determining the number of interval patterns that cover each object. We extend this work with HFips, which samples interval patterns proportionally to both their frequency and hyper-volume. These methods efficiently tackle the well-known long-tail phenomenon in pattern sampling. We formally prove that Fips and HFips sample interval patterns in proportion to their frequency and the product of hyper-volume and frequency, respectively. Through experiments on several databases, we demonstrate the quality of the obtained patterns and their robustness against the long-tail phenomenon.
模式抽样已经成为大型数据库中信息发现的一种很有前途的方法,它允许分析人员专注于模式的可管理子集。在这种方法中,模式是基于兴趣度量随机绘制的,例如频率或超容量。本文提出了用于处理数值数据库中的区间模式的第一种抽样方法。这种方法被称为Fips,它按频率的比例对间隔模式进行采样。它使用多步采样程序,并解决了数值数据中的一个关键挑战:准确确定覆盖每个对象的间隔模式的数量。我们用HFips扩展了这项工作,它对间隔模式进行采样,与它们的频率和超容量成比例。这些方法有效地解决了模式采样中众所周知的长尾现象。我们正式证明了Fips和HFips的采样间隔模式分别与它们的频率和超容量和频率的乘积成比例。通过在多个数据库上的实验,我们证明了所获得的模式的质量及其对长尾现象的鲁棒性。
{"title":"Efficiently sampling interval patterns from numerical databases","authors":"Djawad Bekkoucha ,&nbsp;Lamine Diop ,&nbsp;Abdelkader Ouali ,&nbsp;Bruno Crémilleux ,&nbsp;Patrice Boizumault","doi":"10.1016/j.datak.2026.102566","DOIUrl":"10.1016/j.datak.2026.102566","url":null,"abstract":"<div><div>Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness measure, such as frequency or hyper-volume. This paper presents the first sampling approach designed to handle interval patterns in numerical databases. This approach, named <span>Fips</span>, samples interval patterns proportionally to their frequency. It uses a multi-step sampling procedure and addresses a key challenge in numerical data: accurately determining the number of interval patterns that cover each object. We extend this work with <span>HFips</span>, which samples interval patterns proportionally to both their frequency and hyper-volume. These methods efficiently tackle the well-known long-tail phenomenon in pattern sampling. We formally prove that <span>Fips</span> and <span>HFips</span> sample interval patterns in proportion to their frequency and the product of hyper-volume and frequency, respectively. Through experiments on several databases, we demonstrate the quality of the obtained patterns and their robustness against the long-tail phenomenon.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102566"},"PeriodicalIF":2.7,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-treatment uplift evaluation on non-random assignment biased data 非随机分配偏置数据的多处理提升评价
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-24 DOI: 10.1016/j.datak.2026.102565
Nathan Le Boudec , Nicolas Voisine , Bruno Crémilleux
Uplift quantifies the impact of an action (marketing, medical treatment) on an individual’s behavior. Uplift prediction is based on the assumption that the target and control groups are equivalent. However, in real-world scenarios, customers are often selected for actions based on their prior behavior, introducing non-random assignment bias that distorts uplift estimation. This issue is even more present in the case of multi-treatment, as in the context of offer recommendation system, where multiple actions are possible for an individual. To the best of our knowledge, the effect of bias in multi-treatment uplift has not yet been studied. In this paper, we propose a novel protocol for evaluating multi-treatment uplift under non-random assignment bias. Using this protocol, we assess the performance of the main multi-treatment uplift methods from the literature. Our results show significant differences in their robustness to bias, providing valuable insights and guidelines for practical applications in biased settings.
提升量化了一种行为(营销、医疗)对个人行为的影响。隆升预测是基于目标组和对照组相等的假设。然而,在现实场景中,客户通常是根据他们先前的行为来选择行动的,这引入了非随机分配偏差,从而扭曲了提升估计。这个问题在多处理的情况下更为明显,比如在报价推荐系统的背景下,一个人可能有多个操作。据我们所知,还没有研究过多处理抬升中偏置的影响。在本文中,我们提出了一种新的方案来评估非随机分配偏差下的多处理抬升。使用该方案,我们从文献中评估了主要的多治疗隆起方法的性能。我们的研究结果显示,它们对偏倚的稳健性存在显著差异,为偏倚设置中的实际应用提供了有价值的见解和指导。
{"title":"Multi-treatment uplift evaluation on non-random assignment biased data","authors":"Nathan Le Boudec ,&nbsp;Nicolas Voisine ,&nbsp;Bruno Crémilleux","doi":"10.1016/j.datak.2026.102565","DOIUrl":"10.1016/j.datak.2026.102565","url":null,"abstract":"<div><div>Uplift quantifies the impact of an action (marketing, medical treatment) on an individual’s behavior. Uplift prediction is based on the assumption that the target and control groups are equivalent. However, in real-world scenarios, customers are often selected for actions based on their prior behavior, introducing non-random assignment bias that distorts uplift estimation. This issue is even more present in the case of multi-treatment, as in the context of offer recommendation system, where multiple actions are possible for an individual. To the best of our knowledge, the effect of bias in multi-treatment uplift has not yet been studied. In this paper, we propose a novel protocol for evaluating multi-treatment uplift under non-random assignment bias. Using this protocol, we assess the performance of the main multi-treatment uplift methods from the literature. Our results show significant differences in their robustness to bias, providing valuable insights and guidelines for practical applications in biased settings.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102565"},"PeriodicalIF":2.7,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A commonsense reasoning framework for substitution in cooking 烹饪中替代的常识性推理框架
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-22 DOI: 10.1016/j.datak.2026.102558
Antonis Bikakis , Aissatou Diallo , Luke Dickens , Anthony Hunter , Rob Miller
The ability to substitute some resource or tool for another is a common and important human ability. For example, in cooking, we often lack an ingredient for a recipe and we solve this problem by finding a substitute ingredient. There are various ways that we may reason about this. Often we need to draw on commonsense reasoning to find a substitute. For instance, we can think of the properties of the missing item, and try to find similar items with similar properties. Despite the importance of substitution in human intelligence, there is a lack of a theoretical understanding of the faculty. To address this shortcoming, we propose a commonsense reasoning framework for conceptualizing and harnessing substitution. In order to ground our proposal, we focus on cooking. Though we believe the proposal can be straightforwardly adapted to other applications that require formalization of substitution. Our approach is to produce a general framework based on distance measures for determining similarity (e.g. between ingredients, or between processing steps), and on identifying inconsistencies between the logical representation of recipes and integrity constraints that we use to flag the need for mitigation (e.g. after substituting one kind of pasta for another in a recipe, we may identify an inconsistency in the cooking time, and this is resolved by updating the cooking time).
用某种资源或工具替代另一种资源或工具的能力是人类普遍而重要的能力。例如,在烹饪中,我们经常缺少一种原料,我们通过寻找替代原料来解决这个问题。我们可以用不同的方式来推理。我们常常需要利用常识性推理来寻找替代品。例如,我们可以考虑丢失物品的属性,并尝试找到具有相似属性的类似物品。尽管替代在人类智能中的重要性,但缺乏对该能力的理论理解。为了解决这个缺点,我们提出了一个概念化和利用替代的常识性推理框架。为了使我们的提议有根据,我们把重点放在烹饪上。虽然我们认为该建议可以直接适用于需要形式化替代的其他应用。我们的方法是根据距离度量来确定相似性(例如,在配料之间,或在加工步骤之间),以及在确定食谱的逻辑表示和完整性约束之间的不一致性(例如,在食谱中用一种面食代替另一种面食后,我们可能会发现烹饪时间不一致,这可以通过更新烹饪时间来解决)的基础上产生一个总体框架。
{"title":"A commonsense reasoning framework for substitution in cooking","authors":"Antonis Bikakis ,&nbsp;Aissatou Diallo ,&nbsp;Luke Dickens ,&nbsp;Anthony Hunter ,&nbsp;Rob Miller","doi":"10.1016/j.datak.2026.102558","DOIUrl":"10.1016/j.datak.2026.102558","url":null,"abstract":"<div><div>The ability to substitute some resource or tool for another is a common and important human ability. For example, in cooking, we often lack an ingredient for a recipe and we solve this problem by finding a substitute ingredient. There are various ways that we may reason about this. Often we need to draw on commonsense reasoning to find a substitute. For instance, we can think of the properties of the missing item, and try to find similar items with similar properties. Despite the importance of substitution in human intelligence, there is a lack of a theoretical understanding of the faculty. To address this shortcoming, we propose a commonsense reasoning framework for conceptualizing and harnessing substitution. In order to ground our proposal, we focus on cooking. Though we believe the proposal can be straightforwardly adapted to other applications that require formalization of substitution. Our approach is to produce a general framework based on distance measures for determining similarity (e.g. between ingredients, or between processing steps), and on identifying inconsistencies between the logical representation of recipes and integrity constraints that we use to flag the need for mitigation (e.g. after substituting one kind of pasta for another in a recipe, we may identify an inconsistency in the cooking time, and this is resolved by updating the cooking time).</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102558"},"PeriodicalIF":2.7,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A contextual hierarchical attention network for detecting mental health disorders using social media 使用社交媒体检测精神健康障碍的语境分层注意网络
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-22 DOI: 10.1016/j.datak.2026.102560
Ron Hochstenbach, Flavius Frasincar, Jasmijn Klinkhamer
Growing parts of the population suffer from mental health problems and psychologists lack capacity to diagnose, let alone treat, all those in need of it. Given recent advancements in the field, deep learning-based NLP techniques could help by detecting those in need of help based on their written text. To this end, this work improves the current state-of-the-art Hierarchical Attention Network (HAN) model by incorporating contextual awareness through BERT-based word embeddings and a multi-head self-attention user-encoder yielding the Context-HAN model. When trained and tested on the eRisk data sets on Self-Harm, Anorexia, and Depression, Context-HAN outperformed the HAN model across all data sets based on various evaluation measures. Furthermore, we find and discuss some interesting insights from analysis of the attention scores, such as that longer and more recently written posts are more important for classification. This work shows the potential of attention mechanisms to leverage contextual information to improve the effectiveness of NLP methods at detecting mental health disorders from user-written text.
越来越多的人患有心理健康问题,而心理学家缺乏诊断能力,更不用说治疗所有需要的人了。鉴于该领域的最新进展,基于深度学习的自然语言处理技术可以通过根据书面文本检测需要帮助的人来提供帮助。为此,本研究改进了当前最先进的层次注意网络(HAN)模型,通过基于bert的词嵌入和一个多头自注意用户编码器结合上下文意识,产生了上下文注意网络模型。当在自我伤害、厌食症和抑郁的风险数据集上进行训练和测试时,基于各种评估措施,Context-HAN在所有数据集上的表现都优于HAN模型。此外,我们从对注意力分数的分析中发现并讨论了一些有趣的见解,比如更长的和最近写的帖子对分类更重要。这项工作显示了注意机制利用上下文信息来提高NLP方法在从用户书写的文本中检测精神健康障碍方面的有效性的潜力。
{"title":"A contextual hierarchical attention network for detecting mental health disorders using social media","authors":"Ron Hochstenbach,&nbsp;Flavius Frasincar,&nbsp;Jasmijn Klinkhamer","doi":"10.1016/j.datak.2026.102560","DOIUrl":"10.1016/j.datak.2026.102560","url":null,"abstract":"<div><div>Growing parts of the population suffer from mental health problems and psychologists lack capacity to diagnose, let alone treat, all those in need of it. Given recent advancements in the field, deep learning-based NLP techniques could help by detecting those in need of help based on their written text. To this end, this work improves the current state-of-the-art Hierarchical Attention Network (HAN) model by incorporating contextual awareness through BERT-based word embeddings and a multi-head self-attention user-encoder yielding the Context-HAN model. When trained and tested on the eRisk data sets on Self-Harm, Anorexia, and Depression, Context-HAN outperformed the HAN model across all data sets based on various evaluation measures. Furthermore, we find and discuss some interesting insights from analysis of the attention scores, such as that longer and more recently written posts are more important for classification. This work shows the potential of attention mechanisms to leverage contextual information to improve the effectiveness of NLP methods at detecting mental health disorders from user-written text.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102560"},"PeriodicalIF":2.7,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146039183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Selection of secondary features from multi-table data for classification 从多表数据中选择次要特征进行分类
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-20 DOI: 10.1016/j.datak.2026.102559
Nicolas Voisine , Lou-Anne Quellet , Marc Boullé , Fabrice Clérot , Anais Collin
Multi-table data is common in organizations, and its analysis is crucial for applications such as fraud detection, service improvement, and customer relationship management. Processing this type of data requires flattening, which transforms the multi-table structure into a single flat table by creating aggregates from the original variables. Several propositionalization tools aim to automate this process, but as data complexity increases due to the number of tables and relationships, the effectiveness of flattening decreases. To enhance the quality of propositionalization, it is essential to develop automated preprocessing systems that optimize the construction of aggregates by focusing on the most informative variables.
The objective of this article is to propose a method for selecting secondary variables and to demonstrate that this approach effectively filters out non-informative variables using a univariate analysis. Finally, we will show, using a set of academic datasets, that reducing the number of secondary variables to only those that are truly informative can improve classification performance.
多表数据在组织中很常见,其分析对于欺诈检测、服务改进和客户关系管理等应用程序至关重要。处理这种类型的数据需要扁平化,通过从原始变量创建聚合,将多表结构转换为单个扁平化表。一些命题化工具的目标是自动化这个过程,但是由于表和关系的数量增加了数据复杂性,平坦化的有效性降低了。为了提高命题化的质量,必须开发自动化预处理系统,通过关注最具信息量的变量来优化聚合的构建。本文的目的是提出一种选择次要变量的方法,并证明这种方法有效地过滤掉使用单变量分析的非信息性变量。最后,我们将使用一组学术数据集来证明,将次要变量的数量减少到只有那些真正具有信息量的变量可以提高分类性能。
{"title":"Selection of secondary features from multi-table data for classification","authors":"Nicolas Voisine ,&nbsp;Lou-Anne Quellet ,&nbsp;Marc Boullé ,&nbsp;Fabrice Clérot ,&nbsp;Anais Collin","doi":"10.1016/j.datak.2026.102559","DOIUrl":"10.1016/j.datak.2026.102559","url":null,"abstract":"<div><div>Multi-table data is common in organizations, and its analysis is crucial for applications such as fraud detection, service improvement, and customer relationship management. Processing this type of data requires flattening, which transforms the multi-table structure into a single flat table by creating aggregates from the original variables. Several propositionalization tools aim to automate this process, but as data complexity increases due to the number of tables and relationships, the effectiveness of flattening decreases. To enhance the quality of propositionalization, it is essential to develop automated preprocessing systems that optimize the construction of aggregates by focusing on the most informative variables.</div><div>The objective of this article is to propose a method for selecting secondary variables and to demonstrate that this approach effectively filters out non-informative variables using a univariate analysis. Finally, we will show, using a set of academic datasets, that reducing the number of secondary variables to only those that are truly informative can improve classification performance.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102559"},"PeriodicalIF":2.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A BERT model and momentum contrastive learning based sequential recommendation method and its implementation 基于BERT模型和动量对比学习的顺序推荐方法及其实现
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-08 DOI: 10.1016/j.datak.2026.102556
Mingjun Xin, Ze He, Zhijun Xiao
The sequential recommendation task is an important research direction in the recommendation system. Previous sequential recommendation researches mainly focus on the user–item interaction sequence and mine collaborative information from it. Although these studies have achieved certain results, existing studies tend to pay less attention to other rich information, such as item description, item label, user review, etc. In fact, this rich information can aid in learning the embedding representation of items and modeling user preferences. To tackle this issue, we propose A BERT model and momentum contrastive learning based sequential recommendation method named BertMoSRec. The BERT block uses the BERT model to learn the embedding representation of items in combination with item description, item label and user review, and then uses an embedding smoothing task to obtain the isotropic semantic representation. In the momentum contrastive learning block, we use a variety of data augmentation methods to maintain a large negative sample queue, which is used to compare and learn the user item interaction sequence, learn the embedding representation of the sequences, capture user preference information, and reduce the requirements for computing resources. Extensive experiments on multiple subsets of the Amazon dataset demonstrate the effectiveness of our proposed method.
顺序推荐任务是推荐系统中的一个重要研究方向。以往的顺序推荐研究主要关注用户-物品交互顺序,并从中挖掘协同信息。虽然这些研究已经取得了一定的成果,但是现有的研究往往较少关注其他丰富的信息,如物品描述、物品标签、用户评论等。事实上,这些丰富的信息可以帮助学习项目的嵌入表示和对用户偏好建模。为了解决这个问题,我们提出了一种基于BERT模型和动量对比学习的顺序推荐方法BertMoSRec。BERT块使用BERT模型结合物品描述、物品标签和用户评论学习物品的嵌入表示,然后使用嵌入平滑任务获得各向同性语义表示。在动量对比学习块中,我们使用多种数据增强方法维持一个大的负样本队列,用于比较和学习用户项目交互序列,学习序列的嵌入表示,捕获用户偏好信息,减少对计算资源的需求。在Amazon数据集的多个子集上进行的大量实验证明了我们提出的方法的有效性。
{"title":"A BERT model and momentum contrastive learning based sequential recommendation method and its implementation","authors":"Mingjun Xin,&nbsp;Ze He,&nbsp;Zhijun Xiao","doi":"10.1016/j.datak.2026.102556","DOIUrl":"10.1016/j.datak.2026.102556","url":null,"abstract":"<div><div>The sequential recommendation task is an important research direction in the recommendation system. Previous sequential recommendation researches mainly focus on the user–item interaction sequence and mine collaborative information from it. Although these studies have achieved certain results, existing studies tend to pay less attention to other rich information, such as item description, item label, user review, etc. In fact, this rich information can aid in learning the embedding representation of items and modeling user preferences. To tackle this issue, we propose A BERT model and momentum contrastive learning based sequential recommendation method named <strong>BertMoSRec</strong>. The BERT block uses the BERT model to learn the embedding representation of items in combination with item description, item label and user review, and then uses an embedding smoothing task to obtain the isotropic semantic representation. In the momentum contrastive learning block, we use a variety of data augmentation methods to maintain a large negative sample queue, which is used to compare and learn the user item interaction sequence, learn the embedding representation of the sequences, capture user preference information, and reduce the requirements for computing resources. Extensive experiments on multiple subsets of the Amazon dataset demonstrate the effectiveness of our proposed method.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102556"},"PeriodicalIF":2.7,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MRV-RSA: Developed Modified Random Value Reptile Search Algorithm and Deep Learning based Fraud Detection Model in Banking Sector MRV-RSA:改进的随机值爬虫搜索算法和基于深度学习的银行业欺诈检测模型
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-08 DOI: 10.1016/j.datak.2026.102557
V. Backiyalakshmi , B. Umadevi
The banking sector is significant in economic growth in each nation. Also, each and every person has a separate account in diverse banks for effectively transmitting the money at any time. The proliferation of online banking has brought about a concerning rise in fraudulent transactions, posing a persistent challenge for fraud detection. This contains a collection of fraudulent activities, as well as insurance, credit card, and accounting fraud. Despite the numerous benefits of online transactions, the prevalence of financial fraud and unauthorized transactions poses significant risks. Several researchers have constantly developed various techniques in the past few years to improve detection performance. Yet, it takes more duration for handling massive amounts of various client data sizes to detect abnormal activities. With the aim of resolving these issues, a deep learning based new approach is designed in this research work. Initially, the prescribed data are gathered from the benchmark database, then the gathered data is given to the phase of feature extraction. In this phase, the Principal Component Analysis (PCA), statistical features, and T-distributed Stochastic Neighbor Embedding (t-SNE) mechanisms are utilized to effectively extract the informative features from the collected data. It can optimally minimize the noise and irrelevant information to enhance the training speed. Then, the extracted features are combined and the optimal weighted fused features are determined by utilizing the Modified Random Value Reptile Search Algorithm (MRV-RSA) optimization algorithm. It can effectively improve the training speed and overall performance enabling better detection. Also, the optimal weighted fused features are given to the detection phase using the Dilated Convolution Long Short Term Memory (ConvLSTM) with Multi-scale Dense Attention (DCL-MDA) technique. It can handle massive complex datasets without incurring generalization problems. Further, the classified detected result is provided with a limited duration. Therefore, the efficiency of the model is validated by using the different metrics and contrasted over other traditional models. Hence, the suggested system overwhelms the desired value for finding the fraudulent user to enhance the security level in the banking sector. From the evaluation process, the implemented framework has attained a reliable accuracy rate of 93.86% in Dataset 1 and 97.15% in Dataset 2 to prove its superior performance. This performance enhancement in the developed model could accurately detect fraud at an earlier stage.
银行业在每个国家的经济增长中都起着重要作用。此外,每个人在不同的银行都有一个单独的账户,以便随时有效地转移资金。网上银行的普及带来了令人担忧的欺诈交易的增加,给欺诈检测带来了持续的挑战。这包含一系列欺诈活动,以及保险、信用卡和会计欺诈。尽管网上交易有许多好处,但普遍存在的金融欺诈和未经授权的交易构成了重大风险。在过去的几年中,一些研究人员不断开发各种技术来提高检测性能。然而,处理大量不同大小的客户端数据以检测异常活动需要更长的持续时间。为了解决这些问题,本研究设计了一种基于深度学习的新方法。首先从基准数据库中收集指定的数据,然后将收集到的数据输入到特征提取阶段。在此阶段,利用主成分分析(PCA)、统计特征和t分布随机邻居嵌入(t-SNE)机制有效地从收集的数据中提取信息特征。它可以最大限度地减少噪声和不相关信息,提高训练速度。然后,利用修正随机值爬行动物搜索算法(MRV-RSA)优化算法对提取的特征进行组合,确定最优加权融合特征;它可以有效地提高训练速度和整体性能,从而实现更好的检测。同时,利用扩展卷积长短期记忆(ConvLSTM)和多尺度密集注意(DCL-MDA)技术将最优加权融合特征分配到检测阶段。它可以处理大量复杂的数据集而不会产生泛化问题。此外,将检测到的分类结果提供有限的持续时间。因此,通过使用不同的度量来验证模型的有效性,并与其他传统模型进行对比。因此,建议的系统超过了寻找欺诈用户以提高银行业安全级别的期望值。从评估过程来看,所实现的框架在数据集1和数据集2上的可靠准确率分别达到了93.86%和97.15%,证明了其优越的性能。在开发的模型中,这种性能增强可以在早期阶段准确地检测欺诈。
{"title":"MRV-RSA: Developed Modified Random Value Reptile Search Algorithm and Deep Learning based Fraud Detection Model in Banking Sector","authors":"V. Backiyalakshmi ,&nbsp;B. Umadevi","doi":"10.1016/j.datak.2026.102557","DOIUrl":"10.1016/j.datak.2026.102557","url":null,"abstract":"<div><div>The banking sector is significant in economic growth in each nation. Also, each and every person has a separate account in diverse banks for effectively transmitting the money at any time. The proliferation of online banking has brought about a concerning rise in fraudulent transactions, posing a persistent challenge for fraud detection. This contains a collection of fraudulent activities, as well as insurance, credit card, and accounting fraud. Despite the numerous benefits of online transactions, the prevalence of financial fraud and unauthorized transactions poses significant risks. Several researchers have constantly developed various techniques in the past few years to improve detection performance. Yet, it takes more duration for handling massive amounts of various client data sizes to detect abnormal activities. With the aim of resolving these issues, a deep learning based new approach is designed in this research work. Initially, the prescribed data are gathered from the benchmark database, then the gathered data is given to the phase of feature extraction. In this phase, the Principal Component Analysis (PCA), statistical features, and T-distributed Stochastic Neighbor Embedding (t-SNE) mechanisms are utilized to effectively extract the informative features from the collected data. It can optimally minimize the noise and irrelevant information to enhance the training speed. Then, the extracted features are combined and the optimal weighted fused features are determined by utilizing the Modified Random Value Reptile Search Algorithm (MRV-RSA) optimization algorithm. It can effectively improve the training speed and overall performance enabling better detection. Also, the optimal weighted fused features are given to the detection phase using the Dilated Convolution Long Short Term Memory (ConvLSTM) with Multi-scale Dense Attention (DCL-MDA) technique. It can handle massive complex datasets without incurring generalization problems. Further, the classified detected result is provided with a limited duration. Therefore, the efficiency of the model is validated by using the different metrics and contrasted over other traditional models. Hence, the suggested system overwhelms the desired value for finding the fraudulent user to enhance the security level in the banking sector. From the evaluation process, the implemented framework has attained a reliable accuracy rate of 93.86% in Dataset 1 and 97.15% in Dataset 2 to prove its superior performance. This performance enhancement in the developed model could accurately detect fraud at an earlier stage.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102557"},"PeriodicalIF":2.7,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The tendency-based multi-criteria group recommendation systems 基于趋势的多准则组推荐系统
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-07 DOI: 10.1016/j.datak.2025.102553
Tugba Turkoglu Kaya
Aggregation strategies in group recommender systems often fall short in balancing diverse user preferences and ensuring fair satisfaction within the group. These limitations become more pronounced in single-criteria frameworks, where the multidimensional nature of user–item interactions is overlooked, thereby restricting the system’s capacity to capture subtle preference variations. While multi-criteria recommendation offers a promising solution by incorporating multiple evaluation dimensions, the adaptation of single-criteria aggregation mechanisms to a multi-criteria setting remains an open research question. For the purpose, in the study, new aggregation techniques and top-n recommendation system mechanism are developed for a new multi-criteria group recommendation system. While user tendencies and qualitative sequences of user evaluations are taken into account in the new combining techniques called weighted preference aggregation, preference without weighted aggregation and weighted without preference vector aggregation the newly developed top-n recommendation system aims to prepare a recommendation list according to group tendencies by using product characteristic structures. In the studies carried out on two different data sets (Yahoo!Movies, TripAdvisor) for three group size (1, 5, 10%), a comparative analysis of each of the proposed methods is made with the methods available in the literature. When the results are examined, it is seen that the proposed methods give very successful results.
群体推荐系统中的聚合策略在平衡不同用户偏好和确保群体内的公平满意度方面往往存在不足。这些限制在单一标准框架中变得更加明显,其中忽略了用户-物品交互的多维性质,从而限制了系统捕捉细微偏好变化的能力。虽然多标准推荐通过纳入多个评估维度提供了一个有希望的解决方案,但单标准聚合机制对多标准设置的适应仍然是一个开放的研究问题。为此,本研究提出了新的聚合技术和top-n推荐系统机制,构建了一个新的多准则群推荐系统。在加权偏好聚合、不加权偏好聚合和不加权偏好向量聚合的组合技术中,考虑了用户倾向和用户评价的定性序列,新开发的top-n推荐系统旨在利用产品特征结构,根据群体倾向编制推荐列表。在对两个不同的数据集(Yahoo!电影,TripAdvisor)的三组规模(1,5,10 %),与文献中可用的方法对每种提出的方法进行比较分析。通过对结果的检验,可以看出所提出的方法给出了非常成功的结果。
{"title":"The tendency-based multi-criteria group recommendation systems","authors":"Tugba Turkoglu Kaya","doi":"10.1016/j.datak.2025.102553","DOIUrl":"10.1016/j.datak.2025.102553","url":null,"abstract":"<div><div>Aggregation strategies in group recommender systems often fall short in balancing diverse user preferences and ensuring fair satisfaction within the group. These limitations become more pronounced in single-criteria frameworks, where the multidimensional nature of user–item interactions is overlooked, thereby restricting the system’s capacity to capture subtle preference variations. While multi-criteria recommendation offers a promising solution by incorporating multiple evaluation dimensions, the adaptation of single-criteria aggregation mechanisms to a multi-criteria setting remains an open research question. For the purpose, in the study, new aggregation techniques and top-<em>n</em> recommendation system mechanism are developed for a new multi-criteria group recommendation system. While user tendencies and qualitative sequences of user evaluations are taken into account in the new combining techniques called weighted preference aggregation, preference without weighted aggregation and weighted without preference vector aggregation the newly developed top-<em>n</em> recommendation system aims to prepare a recommendation list according to group tendencies by using product characteristic structures. In the studies carried out on two different data sets (Yahoo!Movies, TripAdvisor) for three group size (1, 5, 10%), a comparative analysis of each of the proposed methods is made with the methods available in the literature. When the results are examined, it is seen that the proposed methods give very successful results.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"162 ","pages":"Article 102553"},"PeriodicalIF":2.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From primes to paths: Enabling fast multi-relational graph analysis 从素数到路径:支持快速多关系图分析
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-07 DOI: 10.1016/j.datak.2026.102554
Konstantinos Bougiatiotis , Georgios Paliouras
Multi-relational networks capture intricate relationships in data and have diverse applications across fields such as biomedical, financial, and social sciences. As networks derived from increasingly large datasets become more common, identifying efficient methods for representing and analyzing them becomes crucial. This work extends the Prime Adjacency Matrices (PAMs) framework, which employs prime numbers to represent distinct relations within a network uniquely. This enables a compact representation of a complete multi-relational graph using a single adjacency matrix, which, in turn, facilitates quick computation of multi-hop adjacency matrices. In this work, we enhance the framework by introducing a lossless algorithm for calculating the multi-hop matrices and propose the Bag of Paths (BoP) representation, a versatile feature extraction methodology for various graph analytics tasks, at the node, edge, and graph level. We demonstrate the efficiency of the framework across various tasks and datasets, showing that simple BoP-based models perform comparably to or better than commonly used neural models while improving speed by orders of magnitude.
多关系网络捕获数据中的复杂关系,并在生物医学、金融和社会科学等领域具有不同的应用。随着来自越来越大的数据集的网络变得越来越普遍,确定有效的方法来表示和分析它们变得至关重要。这项工作扩展了素数邻接矩阵(PAMs)框架,该框架使用素数来唯一地表示网络中的不同关系。这使得使用单个邻接矩阵可以紧凑地表示完整的多关系图,这反过来又促进了多跳邻接矩阵的快速计算。在这项工作中,我们通过引入一种用于计算多跳矩阵的无损算法来增强框架,并提出了路径包(BoP)表示,这是一种在节点、边缘和图级别上用于各种图分析任务的通用特征提取方法。我们展示了该框架在各种任务和数据集上的效率,表明简单的基于bp的模型的性能与常用的神经模型相当或更好,同时将速度提高了几个数量级。
{"title":"From primes to paths: Enabling fast multi-relational graph analysis","authors":"Konstantinos Bougiatiotis ,&nbsp;Georgios Paliouras","doi":"10.1016/j.datak.2026.102554","DOIUrl":"10.1016/j.datak.2026.102554","url":null,"abstract":"<div><div>Multi-relational networks capture intricate relationships in data and have diverse applications across fields such as biomedical, financial, and social sciences. As networks derived from increasingly large datasets become more common, identifying efficient methods for representing and analyzing them becomes crucial. This work extends the Prime Adjacency Matrices (PAMs) framework, which employs prime numbers to represent distinct relations within a network uniquely. This enables a compact representation of a complete multi-relational graph using a single adjacency matrix, which, in turn, facilitates quick computation of multi-hop adjacency matrices. In this work, we enhance the framework by introducing a lossless algorithm for calculating the multi-hop matrices and propose the Bag of Paths (BoP) representation, a versatile feature extraction methodology for various graph analytics tasks, at the node, edge, and graph level. We demonstrate the efficiency of the framework across various tasks and datasets, showing that simple BoP-based models perform comparably to or better than commonly used neural models while improving speed by orders of magnitude.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"163 ","pages":"Article 102554"},"PeriodicalIF":2.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145941279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Data & Knowledge Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1