首页 > 最新文献

Information Systems最新文献

英文 中文
New compressed indices for multijoins on graph databases 新的压缩索引在图数据库上的多连接
IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-04-01 Epub Date: 2025-11-13 DOI: 10.1016/j.is.2025.102647
Diego Arroyuelo , Fabrizio Barisione , Antonio Fariña , Adrián Gómez-Brandón , Gonzalo Navarro
A recent surprising result in the implementation of worst-case-optimal (wco) multijoins in graph databases (specifically, basic graph patterns) is that they can be supported on graph representations that take even less space than a plain representation, and orders of magnitude less space than classical indices, while offering comparable performance. In this paper we uncover a wide set of new wco space–time tradeoffs: we (1) introduce new compact indices that handle multijoins in wco time, and (2) combine them with new query resolution strategies that offer better times in practice. As a result, we improve the average query times of current compact representations by a factor of up to 13 to produce the first 1000 results, and using twice their space, reduce their total average query time by a factor of 2. Our experiments suggest that there is more room for improvement in terms of generating better query plans for multijoins.
最近在图数据库(特别是基本图模式)中实现最坏情况最优(wco)多连接的一个令人惊讶的结果是,它们可以在比普通表示占用更少空间的图表示上得到支持,并且比经典索引占用的空间少几个数量级,同时提供相当的性能。在本文中,我们揭示了一系列新的wco时空权衡:我们(1)引入了新的紧凑索引,在wco时间内处理多连接;(2)将它们与新的查询解析策略结合起来,在实践中提供更好的时间。因此,我们将当前压缩表示的平均查询时间提高了13倍,以生成前1000个结果,并且使用两倍的空间,将它们的总平均查询时间减少了2倍。我们的实验表明,在为多连接生成更好的查询计划方面还有更多的改进空间。
{"title":"New compressed indices for multijoins on graph databases","authors":"Diego Arroyuelo ,&nbsp;Fabrizio Barisione ,&nbsp;Antonio Fariña ,&nbsp;Adrián Gómez-Brandón ,&nbsp;Gonzalo Navarro","doi":"10.1016/j.is.2025.102647","DOIUrl":"10.1016/j.is.2025.102647","url":null,"abstract":"<div><div>A recent surprising result in the implementation of worst-case-optimal (<span>wco</span>) multijoins in graph databases (specifically, basic graph patterns) is that they can be supported on graph representations that take even less space than a plain representation, and orders of magnitude less space than classical indices, while offering comparable performance. In this paper we uncover a wide set of new <span>wco</span> space–time tradeoffs: we (1) introduce new compact indices that handle multijoins in <span>wco</span> time, and (2) combine them with new query resolution strategies that offer better times in practice. As a result, we improve the average query times of current compact representations by a factor of up to 13 to produce the first 1000 results, and using twice their space, reduce their total average query time by a factor of 2. Our experiments suggest that there is more room for improvement in terms of generating better query plans for multijoins.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102647"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing next activity prediction in process mining with Retrieval-Augmented Generation 利用检索增强生成增强流程挖掘中的下一个活动预测
IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-04-01 Epub Date: 2025-11-03 DOI: 10.1016/j.is.2025.102642
Angelo Casciani , Mario Luca Bernardi , Marta Cimitile , Andrea Marrella
Next activity prediction is one of the main tasks of Predictive Process Monitoring (PPM), enabling organizations to forecast the execution of business processes and respond accordingly. Deep learning models are effective at predictions, but with the price of intensive training and feature engineering, rendering them less generalizable across domains. Large Language Models (LLMs) have been recently suggested as an alternative, but their capabilities in Process Mining tasks are still to be extensively investigated. This work introduces a framework leveraging LLMs and Retrieval-Augmented Generation to enhance their capabilities for predicting next activities. By leveraging sequential information and data attributes from past execution traces, our framework enables LLMs to make more accurate predictions without additional training. We evaluate the approach on a wide range of event logs and compare it with state-of-the-art techniques. Findings show that our framework achieves competitive performance while being more adaptable across domains. Moreover, we assess early prediction capabilities, validate the significance of observed differences through statistical testing, and explore the impact of fine-tuning. Despite these advantages, we also report the framework’s limitations, mainly related to interleaving activity sensitivity and concept drifts. Our findings highlight the potential of retrieval-augmented LLMs in PPM while identifying the need for future research into handling evolving process behaviors and the development of standard benchmarks.
下一个活动预测是预测性流程监控(PPM)的主要任务之一,它使组织能够预测业务流程的执行并做出相应的响应。深度学习模型在预测方面是有效的,但由于密集训练和特征工程的代价,使得它们在跨领域的泛化性较差。大型语言模型(llm)最近被建议作为一种替代方案,但它们在过程挖掘任务中的能力仍有待广泛研究。这项工作引入了一个框架,利用llm和检索增强生成来增强它们预测下一个活动的能力。通过利用来自过去执行轨迹的顺序信息和数据属性,我们的框架使法学硕士能够在没有额外培训的情况下做出更准确的预测。我们在各种事件日志上评估该方法,并将其与最先进的技术进行比较。研究结果表明,我们的框架在实现竞争性性能的同时,更具有跨领域的适应性。此外,我们评估了早期预测能力,通过统计检验验证了观察到的差异的显著性,并探讨了微调的影响。尽管有这些优点,我们也报告了框架的局限性,主要涉及交叉活动敏感性和概念漂移。我们的发现强调了在PPM中检索增强llm的潜力,同时确定了对处理不断发展的过程行为和标准基准开发的未来研究的需要。
{"title":"Enhancing next activity prediction in process mining with Retrieval-Augmented Generation","authors":"Angelo Casciani ,&nbsp;Mario Luca Bernardi ,&nbsp;Marta Cimitile ,&nbsp;Andrea Marrella","doi":"10.1016/j.is.2025.102642","DOIUrl":"10.1016/j.is.2025.102642","url":null,"abstract":"<div><div>Next activity prediction is one of the main tasks of Predictive Process Monitoring (PPM), enabling organizations to forecast the execution of business processes and respond accordingly. Deep learning models are effective at predictions, but with the price of intensive training and feature engineering, rendering them less generalizable across domains. Large Language Models (LLMs) have been recently suggested as an alternative, but their capabilities in Process Mining tasks are still to be extensively investigated. This work introduces a framework leveraging LLMs and Retrieval-Augmented Generation to enhance their capabilities for predicting next activities. By leveraging sequential information and data attributes from past execution traces, our framework enables LLMs to make more accurate predictions without additional training. We evaluate the approach on a wide range of event logs and compare it with state-of-the-art techniques. Findings show that our framework achieves competitive performance while being more adaptable across domains. Moreover, we assess early prediction capabilities, validate the significance of observed differences through statistical testing, and explore the impact of fine-tuning. Despite these advantages, we also report the framework’s limitations, mainly related to interleaving activity sensitivity and concept drifts. Our findings highlight the potential of retrieval-augmented LLMs in PPM while identifying the need for future research into handling evolving process behaviors and the development of standard benchmarks.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102642"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145435655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nine years later: Reflecting on our article 九年后:反思我们的文章
IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-04-01 Epub Date: 2025-11-04 DOI: 10.1016/j.is.2025.102644
Massimiliano de Leoni , Wil M.P. van der Aalst , Marcus Dees
This contribution revisits our article titled “A General Process Mining Framework for Correlating, Predicting, and Clustering Dynamic Behavior Based on Event Logs”, published in the Information Systems journal in 2016. It reflects on how the proposed general framework for process mining has grown in relevance with the rise of AI, emphasizing its value as a extensible approach to transforming event data into analytical and predictive insights. It also discusses how the framework relevance and the underlying message remains valid, including for emerging research directions such as prescriptive analytics, causal and/or object-centric process mining.
这篇文章回顾了我们2016年发表在《信息系统》杂志上的文章《基于事件日志的动态行为关联、预测和聚类的通用流程挖掘框架》。它反映了拟议的流程挖掘通用框架如何随着人工智能的兴起而增长,强调了其作为将事件数据转换为分析和预测见解的可扩展方法的价值。它还讨论了框架相关性和底层信息如何保持有效性,包括新兴的研究方向,如规定性分析、因果关系和/或以对象为中心的过程挖掘。
{"title":"Nine years later: Reflecting on our article","authors":"Massimiliano de Leoni ,&nbsp;Wil M.P. van der Aalst ,&nbsp;Marcus Dees","doi":"10.1016/j.is.2025.102644","DOIUrl":"10.1016/j.is.2025.102644","url":null,"abstract":"<div><div>This contribution revisits our article titled <em>“A General Process Mining Framework for Correlating, Predicting, and Clustering Dynamic Behavior Based on Event Logs”</em>, published in the <em>Information Systems</em> journal in 2016. It reflects on how the proposed general framework for process mining has grown in relevance with the rise of AI, emphasizing its value as a extensible approach to transforming event data into analytical and predictive insights. It also discusses how the framework relevance and the underlying message remains valid, including for emerging research directions such as prescriptive analytics, causal and/or object-centric process mining.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102644"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SOLID-M: An ontology-aware quality framework for conceptual models discovered from event data SOLID-M:一个本体感知的质量框架,用于从事件数据中发现概念模型
IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-04-01 Epub Date: 2025-11-04 DOI: 10.1016/j.is.2025.102641
Andrei Tour , Artem Polyvyanyy , Anna Kalenkova
In Process Mining (PM), “high-level” conceptual models of business processes, in the form of directly-follows graphs, Petri nets, and finite-state automata, are discovered from “low-level” event data recorded by information systems. The quality of the discovered models is usually assessed by measures that depend on assumptions made by discovery algorithms; for example, they often assume that sequences of activities recorded in the event data do not interfere. Models produced by recent discovery algorithms consider domain knowledge and relax these assumptions, making traditional PM measures less suitable for evaluating their quality. This paper proposes an ontology-aware framework, called SOLID-M, for analyzing the quality of conceptual models discovered from event data generated by systems. SOLID-M relies on domain knowledge and provides guidelines for introducing quality measures for models constructed by process discovery algorithms that go beyond the traditional PM assumptions. In addition, the paper describes an instantiation of the framework for assessing the quality of Multi-Agent System models discovered using Agent System Mining techniques, hence addressing a growing demand for data-driven analysis of business processes emerging in interactions of human and artificial intelligence agents.
在流程挖掘(Process Mining, PM)中,从信息系统记录的“低级”事件数据中发现业务流程的“高级”概念模型,其形式为直接跟随图、Petri网和有限状态自动机。所发现模型的质量通常通过依赖于发现算法所做的假设的度量来评估;例如,它们通常假设记录在事件数据中的活动序列不会相互干扰。由最近的发现算法产生的模型考虑了领域知识并放宽了这些假设,使得传统的PM度量不太适合评估它们的质量。本文提出了一个本体感知框架,称为SOLID-M,用于分析从系统生成的事件数据中发现的概念模型的质量。SOLID-M依赖于领域知识,并为超越传统PM假设的过程发现算法构造的模型提供了引入质量度量的指导方针。此外,本文还描述了一个框架的实例,用于评估使用代理系统挖掘技术发现的多代理系统模型的质量,从而解决了对人类和人工智能代理交互中出现的业务流程的数据驱动分析的日益增长的需求。
{"title":"SOLID-M: An ontology-aware quality framework for conceptual models discovered from event data","authors":"Andrei Tour ,&nbsp;Artem Polyvyanyy ,&nbsp;Anna Kalenkova","doi":"10.1016/j.is.2025.102641","DOIUrl":"10.1016/j.is.2025.102641","url":null,"abstract":"<div><div>In Process Mining (PM), “high-level” conceptual models of business processes, in the form of directly-follows graphs, Petri nets, and finite-state automata, are discovered from “low-level” event data recorded by information systems. The quality of the discovered models is usually assessed by measures that depend on assumptions made by discovery algorithms; for example, they often assume that sequences of activities recorded in the event data do not interfere. Models produced by recent discovery algorithms consider domain knowledge and relax these assumptions, making traditional PM measures less suitable for evaluating their quality. This paper proposes an ontology-aware framework, called SOLID-M, for analyzing the quality of conceptual models discovered from event data generated by systems. SOLID-M relies on domain knowledge and provides guidelines for introducing quality measures for models constructed by process discovery algorithms that go beyond the traditional PM assumptions. In addition, the paper describes an instantiation of the framework for assessing the quality of Multi-Agent System models discovered using Agent System Mining techniques, hence addressing a growing demand for data-driven analysis of business processes emerging in interactions of human and artificial intelligence agents.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102641"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local intrinsic dimensionality and the estimation of convergence order 局部固有维数与收敛阶的估计
IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-04-01 Epub Date: 2025-11-21 DOI: 10.1016/j.is.2025.102648
Michael E. Houle, Vincent Oria, Hamideh Sabaei
Fixed-point iteration (FPI) is a crucially important technique at the foundation of many scientific and engineering fields, such as numerical analysis, dynamical systems, optimization, and machine learning. In these domains, algorithmic efficiency and stability is often assessed using the notion of convergence order, a quantity whose estimation has typically involved line fitting in log–log space, or finding the limit of an associated function on differences of sequence values. In this paper, we establish a precise equivalence between the convergence order of a fixed-point update function and the local intrinsic dimensionality (LID) of that function once its fixed point is translated to the origin. Building on this insight, we propose a unified framework for re-purposing existing distributional estimators of LID to estimate the convergence order. Of the LID estimators considered, we show that two, the MLE (Hill) estimator and a Bayesian estimator, have practical and convenient closed-form expressions. We further investigate how these estimators of convergence order can be enhanced using Aitken’s Δ2 method for accelerating convergence in slow scenarios, as well as a Bayesian smoothing layer for reducing variance when the number of samples is small. Empirically, we benchmark our LID-based estimators against classical sequenced-based and curve-fitting methods in three experimental settings: root-finding, general iteration, and machine learning regression. Results indicate that our approaches frequently match or surpass the classical estimators in accuracy, while offering robust performance over a broader range of convergence scenarios.
不动点迭代(FPI)是许多科学和工程领域至关重要的基础技术,如数值分析,动力系统,优化和机器学习。在这些领域中,算法的效率和稳定性通常使用收敛阶的概念来评估,收敛阶的估计通常涉及对数-对数空间中的线拟合,或者在序列值的差异上找到相关函数的极限。本文建立了一个不动点更新函数的收敛阶与该函数的局部固有维数(LID)之间的精确等价。基于这一见解,我们提出了一个统一的框架,用于重新利用现有的LID分布估计器来估计收敛阶。在所考虑的LID估计量中,我们证明了两个估计量,即MLE (Hill)估计量和贝叶斯估计量,具有实用和方便的封闭形式表达式。我们进一步研究了如何使用Aitken的Δ2方法在缓慢场景下加速收敛,以及在样本数量较少时减少方差的贝叶斯平滑层来增强这些收敛顺序的估计量。在经验上,我们将基于lid的估计器与经典的基于序列和曲线拟合方法在三种实验设置中进行了基准测试:寻根、一般迭代和机器学习回归。结果表明,我们的方法在精度上经常匹配或超过经典估计器,同时在更广泛的收敛场景下提供稳健的性能。
{"title":"Local intrinsic dimensionality and the estimation of convergence order","authors":"Michael E. Houle,&nbsp;Vincent Oria,&nbsp;Hamideh Sabaei","doi":"10.1016/j.is.2025.102648","DOIUrl":"10.1016/j.is.2025.102648","url":null,"abstract":"<div><div>Fixed-point iteration (FPI) is a crucially important technique at the foundation of many scientific and engineering fields, such as numerical analysis, dynamical systems, optimization, and machine learning. In these domains, algorithmic efficiency and stability is often assessed using the notion of convergence order, a quantity whose estimation has typically involved line fitting in log–log space, or finding the limit of an associated function on differences of sequence values. In this paper, we establish a precise equivalence between the convergence order of a fixed-point update function and the local intrinsic dimensionality (LID) of that function once its fixed point is translated to the origin. Building on this insight, we propose a unified framework for re-purposing existing distributional estimators of LID to estimate the convergence order. Of the LID estimators considered, we show that two, the MLE (Hill) estimator and a Bayesian estimator, have practical and convenient closed-form expressions. We further investigate how these estimators of convergence order can be enhanced using Aitken’s <span><math><msup><mrow><mi>Δ</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> method for accelerating convergence in slow scenarios, as well as a Bayesian smoothing layer for reducing variance when the number of samples is small. Empirically, we benchmark our LID-based estimators against classical sequenced-based and curve-fitting methods in three experimental settings: root-finding, general iteration, and machine learning regression. Results indicate that our approaches frequently match or surpass the classical estimators in accuracy, while offering robust performance over a broader range of convergence scenarios.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102648"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The many facets of fairness in recommender systems: Consumers, providers and items 推荐系统公平性的许多方面:消费者、供应商和商品
IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-04-01 Epub Date: 2025-11-12 DOI: 10.1016/j.is.2025.102643
Reza Shafiloo, Maria Stratigi, Jaakko Peltonen, Thomas Olsson, Kostas Stefanidis
Autonomous decision-making systems, particularly recommender systems, have received increasing attention concerning fairness, i.e., if all stakeholders affected by such a system are treated equally as a result of the recommendations. Existing approaches primarily focus on fairness between two stakeholders – consumers and providers or consumers and items – treating providers and items as the same entity. However, we argue for the treatment of providers and items as distinct stakeholders to offer more comprehensive models of fairness in recommender systems. To this end, we propose a fairness-aware recommender system, CIPFRS, designed to optimize fairness across all three key stakeholders: consumers, providers, and items. We examine consumer fairness regarding their level of interaction with the system; high and low-activity users should be treated equally. Further, all providers should have an equal opportunity for their products to be recommended. Finally, we propose an approach to implement item fairness in each provider’s inventory. We report an extensive evaluation of the proposed solution through three datasets, demonstrating that considering all three stakeholders yields improved recommendations while minimizing bias.
自主决策系统,特别是推荐系统,在公平性方面受到越来越多的关注,即受这种系统影响的所有利益相关者是否因建议而得到平等对待。现有方法主要关注两个利益相关者(消费者和提供者或消费者和物品)之间的公平性,将提供者和物品视为同一实体。然而,我们主张将提供者和项目作为不同的利益相关者来处理,以提供推荐系统中更全面的公平模型。为此,我们提出了一个公平感知的推荐系统,CIPFRS,旨在优化所有三个关键利益相关者:消费者、供应商和物品的公平性。我们根据消费者与系统的互动程度来检验他们的公平性;应该平等对待活跃度高和低的用户。此外,所有供应商都应该有平等的机会推荐他们的产品。最后,我们提出了一种在每个供应商的库存中实现物品公平的方法。我们通过三个数据集对提议的解决方案进行了广泛的评估,证明考虑所有三个利益相关者可以在最大限度地减少偏见的同时改进建议。
{"title":"The many facets of fairness in recommender systems: Consumers, providers and items","authors":"Reza Shafiloo,&nbsp;Maria Stratigi,&nbsp;Jaakko Peltonen,&nbsp;Thomas Olsson,&nbsp;Kostas Stefanidis","doi":"10.1016/j.is.2025.102643","DOIUrl":"10.1016/j.is.2025.102643","url":null,"abstract":"<div><div>Autonomous decision-making systems, particularly recommender systems, have received increasing attention concerning fairness, i.e., if all stakeholders affected by such a system are treated equally as a result of the recommendations. Existing approaches primarily focus on fairness between two stakeholders – consumers and providers or consumers and items – treating providers and items as the same entity. However, we argue for the treatment of providers and items as distinct stakeholders to offer more comprehensive models of fairness in recommender systems. To this end, we propose a fairness-aware recommender system, CIPFRS, designed to optimize fairness across all three key stakeholders: consumers, providers, and items. We examine consumer fairness regarding their level of interaction with the system; high and low-activity users should be treated equally. Further, all providers should have an equal opportunity for their products to be recommended. Finally, we propose an approach to implement item fairness in each provider’s inventory. We report an extensive evaluation of the proposed solution through three datasets, demonstrating that considering all three stakeholders yields improved recommendations while minimizing bias.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102643"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised and semi-supervised clustering via density and distance-based label propagation and assignment 基于密度和距离的标签传播和分配的无监督和半监督聚类
IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-02-01 Epub Date: 2025-10-28 DOI: 10.1016/j.is.2025.102639
Zhen Jiang, Bolin Niu, Jinxin Gua, Yuping Xing
Density-based clustering is capable of identifying clusters of arbitrary shapes without the need to predefine the number of clusters or their distributions. However, it suffers from varying density and parameter sensitivity. To tackle these challenges, we present the Density and Distance-Based Clustering (DDBC) algorithm, which performs clustering from the backbone to the foliage. Based on the “K_cutoff” neighborhoods of core points, DDBC constructs the cluster backbone through label propagation and subcluster aggregation. Subsequently, we construct cluster prototypes and leverage point-prototype distances to help assign points located outside the backbone. The proposed method effectively mitigates issues related to varying density. Furthermore, we propose a semi-supervised version of DDBC, termed SS-DDBC, which utilizes a few labeled data to guide label propagation and subcluster aggregation. It provides a safe and adaptive approach to leverage class information for semi-supervised clustering. Moreover, we propose automated parameter optimization approaches for DDBC and SS-DDBC, thus addressing the issue of parameter sensitivity. In both unsupervised and semi-supervised settings, we conducted experimental comparisons of DDBC and SS-DDBC with ten state-of-the-art algorithms across a range of benchmark datasets. Both algorithms consistently outperform their competitors in terms of average performance and achieve superior results on the majority of datasets. These experimental results demonstrate the effectiveness of our proposed methods. The source codes for our algorithms are accessible at https://github.com/nblnbl/DDBC.
基于密度的聚类能够识别任意形状的簇,而不需要预先定义簇的数量或它们的分布。然而,它受到不同密度和参数灵敏度的影响。为了解决这些挑战,我们提出了基于密度和距离的聚类(DDBC)算法,该算法执行从主干到叶子的聚类。DDBC基于核心点的“K_cutoff”邻域,通过标签传播和子集群聚合构建集群骨干。随后,我们构建集群原型并利用点原型距离来帮助分配位于主干之外的点。所提出的方法有效地缓解了与密度变化有关的问题。此外,我们提出了一种半监督版本的DDBC,称为SS-DDBC,它利用少量标记数据来指导标签传播和子聚类聚合。它提供了一种安全和自适应的方法来利用类信息进行半监督聚类。此外,我们提出了DDBC和SS-DDBC的自动参数优化方法,从而解决了参数敏感性问题。在无监督和半监督设置中,我们在一系列基准数据集上使用十种最先进的算法对DDBC和SS-DDBC进行了实验比较。这两种算法在平均性能方面始终优于竞争对手,并且在大多数数据集上取得了优异的结果。实验结果证明了所提方法的有效性。我们的算法的源代码可以在https://github.com/nblnbl/DDBC上访问。
{"title":"Unsupervised and semi-supervised clustering via density and distance-based label propagation and assignment","authors":"Zhen Jiang,&nbsp;Bolin Niu,&nbsp;Jinxin Gua,&nbsp;Yuping Xing","doi":"10.1016/j.is.2025.102639","DOIUrl":"10.1016/j.is.2025.102639","url":null,"abstract":"<div><div>Density-based clustering is capable of identifying clusters of arbitrary shapes without the need to predefine the number of clusters or their distributions. However, it suffers from varying density and parameter sensitivity. To tackle these challenges, we present the Density and Distance-Based Clustering (DDBC) algorithm, which performs clustering from the backbone to the foliage. Based on the “K_cutoff” neighborhoods of core points, DDBC constructs the cluster backbone through label propagation and subcluster aggregation. Subsequently, we construct cluster prototypes and leverage point-prototype distances to help assign points located outside the backbone. The proposed method effectively mitigates issues related to varying density. Furthermore, we propose a semi-supervised version of DDBC, termed SS-DDBC, which utilizes a few labeled data to guide label propagation and subcluster aggregation. It provides a safe and adaptive approach to leverage class information for semi-supervised clustering. Moreover, we propose automated parameter optimization approaches for DDBC and SS-DDBC, thus addressing the issue of parameter sensitivity. In both unsupervised and semi-supervised settings, we conducted experimental comparisons of DDBC and SS-DDBC with ten state-of-the-art algorithms across a range of benchmark datasets. Both algorithms consistently outperform their competitors in terms of average performance and achieve superior results on the majority of datasets. These experimental results demonstrate the effectiveness of our proposed methods. The source codes for our algorithms are accessible at <span><span>https://github.com/nblnbl/DDBC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"136 ","pages":"Article 102639"},"PeriodicalIF":3.4,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting multidimensional cubes through intentional analytics 通过有意分析预测多维数据集
IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-02-01 Epub Date: 2025-09-17 DOI: 10.1016/j.is.2025.102628
Matteo Francia , Stefano Rizzi , Matteo Golfarelli , Patrick Marcel
In an attempt to streamline exploratory data analysis of multidimensional cubes, the Intentional Analytics Model ha been proposed as a way to unite OLAP and analytics by allowing users to indicate their analysis intentions and returning cubes enhanced with models. Five intention operators were envisioned to this end; in this work we focus on the predict operator, whose goal is to estimate the missing values of a cube measure starting from known values of the same measure or other measures using different regression models. Although prediction tasks such as forecasting and imputation are routinary for analysts, the added value of our approach is (i) to encapsulate them in a declarative, concise, natural language-like syntax; (ii) to automate the selection of the best measures to be used and the computation of the models, and (iii) to automate the evaluation of the interest of the models computed. First we propose a syntax and a semantics for predict and discuss how enhanced cubes are built by (i) predicting the missing values for a measure based on the available information via one or more models and (ii) highlighting the most interesting prediction. Then we test the operator implementation, proving that its performance is in line with the interactivity requirement of OLAP session and that accurate predictions can be returned.
为了简化多维数据集的探索性数据分析,有意分析模型被提出作为一种统一OLAP和分析的方法,允许用户表明他们的分析意图并返回经过模型增强的数据集。为此,设想了五个意图运营商;在这项工作中,我们专注于预测算子,其目标是从使用不同回归模型的相同度量或其他度量的已知值开始估计立方体度量的缺失值。虽然预测任务,如预测和imputation对分析师来说是常规的,但我们的方法的附加价值是(i)将它们封装在声明性的,简洁的,自然语言般的语法中;(ii)自动选择要使用的最佳度量和模型的计算,以及(iii)自动评估所计算的模型的利益。首先,我们提出了预测的语法和语义,并讨论了如何通过(i)通过一个或多个模型根据可用信息预测度量的缺失值以及(ii)突出显示最有趣的预测来构建增强多维数据集。然后对算子实现进行了测试,证明其性能符合OLAP会话的交互性要求,并能返回准确的预测结果。
{"title":"Predicting multidimensional cubes through intentional analytics","authors":"Matteo Francia ,&nbsp;Stefano Rizzi ,&nbsp;Matteo Golfarelli ,&nbsp;Patrick Marcel","doi":"10.1016/j.is.2025.102628","DOIUrl":"10.1016/j.is.2025.102628","url":null,"abstract":"<div><div>In an attempt to streamline exploratory data analysis of multidimensional cubes, the Intentional Analytics Model ha been proposed as a way to unite OLAP and analytics by allowing users to indicate their analysis intentions and returning cubes enhanced with models. Five intention operators were envisioned to this end; in this work we focus on the <span>predict</span> operator, whose goal is to estimate the missing values of a cube measure starting from known values of the same measure or other measures using different regression models. Although prediction tasks such as forecasting and imputation are routinary for analysts, the added value of our approach is (i) to encapsulate them in a declarative, concise, natural language-like syntax; (ii) to automate the selection of the best measures to be used and the computation of the models, and (iii) to automate the evaluation of the interest of the models computed. First we propose a syntax and a semantics for <span>predict</span> and discuss how enhanced cubes are built by (i) predicting the missing values for a measure based on the available information via one or more models and (ii) highlighting the most interesting prediction. Then we test the operator implementation, proving that its performance is in line with the interactivity requirement of OLAP session and that accurate predictions can be returned.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"136 ","pages":"Article 102628"},"PeriodicalIF":3.4,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145106045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPT-5 and open-weight large language models: Advances in reasoning, transparency, and control GPT-5和开放权重大型语言模型:推理、透明度和控制方面的进展
IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-02-01 Epub Date: 2025-09-18 DOI: 10.1016/j.is.2025.102620
Maikel Leon
The rapid evolution of Generative Pre-trained Transformers (GPTs) has revolutionized natural language processing, enabling models to generate coherent text, solve mathematical problems, write code, and even reason about complex tasks. This paper presents a scientific review of GPT-5, OpenAI’s latest flagship model, and examines its innovations in comparison to previous generations of GPT. We summarize the model’s architecture and features, including hierarchical routing, expanded context windows, and enhanced tool-use capabilities, and survey empirical evidence of improved performance on academic benchmarks. A dedicated section discusses the release of open-weight mixture-of-experts models (GPT-OSS), describing their technical design, licensing, and comparative performance. Our analysis synthesizes findings from recent literature on long-context evaluation, cognitive biases, medical summarization, and hallucination vulnerability, highlighting where GPT-5 advances the state of the art and where challenges remain. We conclude by discussing the implications of open-weight models for transparency and reproducibility and propose directions for future research on evaluation, safety, and agentic behavior.
生成预训练变形器(gpt)的快速发展彻底改变了自然语言处理,使模型能够生成连贯的文本,解决数学问题,编写代码,甚至对复杂任务进行推理。本文对OpenAI最新旗舰模型GPT-5进行了科学回顾,并将其与前几代GPT进行了比较。我们总结了模型的架构和特征,包括分层路由、扩展的上下文窗口和增强的工具使用能力,并调查了在学术基准上改进性能的经验证据。专门的一节讨论了开放式专家混合模型(GPT-OSS)的发布,描述了它们的技术设计、许可和比较性能。我们的分析综合了近期文献中关于长期情境评估、认知偏差、医学总结和幻觉脆弱性的发现,突出了GPT-5在哪些方面取得了进展,哪些方面仍存在挑战。最后,我们讨论了开重模型对透明度和可重复性的影响,并提出了评估、安全性和代理行为的未来研究方向。
{"title":"GPT-5 and open-weight large language models: Advances in reasoning, transparency, and control","authors":"Maikel Leon","doi":"10.1016/j.is.2025.102620","DOIUrl":"10.1016/j.is.2025.102620","url":null,"abstract":"<div><div>The rapid evolution of Generative Pre-trained Transformers (GPTs) has revolutionized natural language processing, enabling models to generate coherent text, solve mathematical problems, write code, and even reason about complex tasks. This paper presents a scientific review of GPT-5, OpenAI’s latest flagship model, and examines its innovations in comparison to previous generations of GPT. We summarize the model’s architecture and features, including hierarchical routing, expanded context windows, and enhanced tool-use capabilities, and survey empirical evidence of improved performance on academic benchmarks. A dedicated section discusses the release of open-weight mixture-of-experts models (GPT-OSS), describing their technical design, licensing, and comparative performance. Our analysis synthesizes findings from recent literature on long-context evaluation, cognitive biases, medical summarization, and hallucination vulnerability, highlighting where GPT-5 advances the state of the art and where challenges remain. We conclude by discussing the implications of open-weight models for transparency and reproducibility and propose directions for future research on evaluation, safety, and agentic behavior.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"136 ","pages":"Article 102620"},"PeriodicalIF":3.4,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145106042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extended parameterized Burrows–Wheeler transform 扩展参数化Burrows-Wheeler变换
IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-02-01 Epub Date: 2025-09-11 DOI: 10.1016/j.is.2025.102611
Eric M. Osterkamp , Dominik Köppl
The Burrows–Wheeler transform (BWT) lies at the heart of succinct and compressed full-text indexes for pattern matching queries. Notable variants are (a) the extended BWT (eBWT) capable to index multiple circular texts for pattern matching, or (b) the parameterized BWT (pBWT) for parameterized pattern matching. A natural extension is the combination of the virtues of both variants into a new data structure, whose name we coin with extended parameterized BWT (epBWT). We show that the epBWT supports pattern matching in context of parameterized pattern matching on multiple circular texts, within the same complexities as known solutions presented for the pBWT [Kim and Cho, IPL’21] for patterns not longer than the shortest indexed text. Additionally, we show how to compute the epBWT within the same complexities as [Iseri et al., ICALP’24], i.e., in compact space and quasilinear time. As an application, we extend the matching statistics problem to the parameterized pattern matching setting on circular texts.
Burrows-Wheeler变换(BWT)是用于模式匹配查询的简洁压缩全文索引的核心。值得注意的变体是(a)能够索引多个循环文本以进行模式匹配的扩展BWT (eBWT),或(b)用于参数化模式匹配的参数化BWT (pBWT)。自然扩展是将这两种变体的优点结合到一个新的数据结构中,我们将其名称与扩展参数化BWT (epBWT)一起命名。我们证明了epBWT在多个圆形文本的参数化模式匹配背景下支持模式匹配,其复杂性与pBWT提出的解决方案相同[Kim和Cho, IPL ' 21],适用于不超过最短索引文本的模式。此外,我们展示了如何在与[Iseri等人,ICALP ' 24]相同的复杂性下计算epBWT,即在紧空间和拟线性时间内。作为应用,我们将匹配统计问题扩展到圆形文本的参数化模式匹配设置。
{"title":"Extended parameterized Burrows–Wheeler transform","authors":"Eric M. Osterkamp ,&nbsp;Dominik Köppl","doi":"10.1016/j.is.2025.102611","DOIUrl":"10.1016/j.is.2025.102611","url":null,"abstract":"<div><div>The Burrows–Wheeler transform (BWT) lies at the heart of succinct and compressed full-text indexes for pattern matching queries. Notable variants are (a) the extended BWT (eBWT) capable to index multiple circular texts for pattern matching, or (b) the parameterized BWT (pBWT) for parameterized pattern matching. A natural extension is the combination of the virtues of both variants into a new data structure, whose name we coin with <em>extended parameterized BWT</em> (epBWT). We show that the epBWT supports pattern matching in context of parameterized pattern matching on multiple circular texts, within the same complexities as known solutions presented for the pBWT [Kim and Cho, IPL’21] for patterns not longer than the shortest indexed text. Additionally, we show how to compute the epBWT within the same complexities as [Iseri et al., ICALP’24], i.e., in compact space and quasilinear time. As an application, we extend the matching statistics problem to the parameterized pattern matching setting on circular texts.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"136 ","pages":"Article 102611"},"PeriodicalIF":3.4,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145106044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Information Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1