Pub Date : 2026-04-01Epub Date: 2025-11-13DOI: 10.1016/j.is.2025.102647
Diego Arroyuelo , Fabrizio Barisione , Antonio Fariña , Adrián Gómez-Brandón , Gonzalo Navarro
A recent surprising result in the implementation of worst-case-optimal (wco) multijoins in graph databases (specifically, basic graph patterns) is that they can be supported on graph representations that take even less space than a plain representation, and orders of magnitude less space than classical indices, while offering comparable performance. In this paper we uncover a wide set of new wco space–time tradeoffs: we (1) introduce new compact indices that handle multijoins in wco time, and (2) combine them with new query resolution strategies that offer better times in practice. As a result, we improve the average query times of current compact representations by a factor of up to 13 to produce the first 1000 results, and using twice their space, reduce their total average query time by a factor of 2. Our experiments suggest that there is more room for improvement in terms of generating better query plans for multijoins.
{"title":"New compressed indices for multijoins on graph databases","authors":"Diego Arroyuelo , Fabrizio Barisione , Antonio Fariña , Adrián Gómez-Brandón , Gonzalo Navarro","doi":"10.1016/j.is.2025.102647","DOIUrl":"10.1016/j.is.2025.102647","url":null,"abstract":"<div><div>A recent surprising result in the implementation of worst-case-optimal (<span>wco</span>) multijoins in graph databases (specifically, basic graph patterns) is that they can be supported on graph representations that take even less space than a plain representation, and orders of magnitude less space than classical indices, while offering comparable performance. In this paper we uncover a wide set of new <span>wco</span> space–time tradeoffs: we (1) introduce new compact indices that handle multijoins in <span>wco</span> time, and (2) combine them with new query resolution strategies that offer better times in practice. As a result, we improve the average query times of current compact representations by a factor of up to 13 to produce the first 1000 results, and using twice their space, reduce their total average query time by a factor of 2. Our experiments suggest that there is more room for improvement in terms of generating better query plans for multijoins.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102647"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2025-11-03DOI: 10.1016/j.is.2025.102642
Angelo Casciani , Mario Luca Bernardi , Marta Cimitile , Andrea Marrella
Next activity prediction is one of the main tasks of Predictive Process Monitoring (PPM), enabling organizations to forecast the execution of business processes and respond accordingly. Deep learning models are effective at predictions, but with the price of intensive training and feature engineering, rendering them less generalizable across domains. Large Language Models (LLMs) have been recently suggested as an alternative, but their capabilities in Process Mining tasks are still to be extensively investigated. This work introduces a framework leveraging LLMs and Retrieval-Augmented Generation to enhance their capabilities for predicting next activities. By leveraging sequential information and data attributes from past execution traces, our framework enables LLMs to make more accurate predictions without additional training. We evaluate the approach on a wide range of event logs and compare it with state-of-the-art techniques. Findings show that our framework achieves competitive performance while being more adaptable across domains. Moreover, we assess early prediction capabilities, validate the significance of observed differences through statistical testing, and explore the impact of fine-tuning. Despite these advantages, we also report the framework’s limitations, mainly related to interleaving activity sensitivity and concept drifts. Our findings highlight the potential of retrieval-augmented LLMs in PPM while identifying the need for future research into handling evolving process behaviors and the development of standard benchmarks.
{"title":"Enhancing next activity prediction in process mining with Retrieval-Augmented Generation","authors":"Angelo Casciani , Mario Luca Bernardi , Marta Cimitile , Andrea Marrella","doi":"10.1016/j.is.2025.102642","DOIUrl":"10.1016/j.is.2025.102642","url":null,"abstract":"<div><div>Next activity prediction is one of the main tasks of Predictive Process Monitoring (PPM), enabling organizations to forecast the execution of business processes and respond accordingly. Deep learning models are effective at predictions, but with the price of intensive training and feature engineering, rendering them less generalizable across domains. Large Language Models (LLMs) have been recently suggested as an alternative, but their capabilities in Process Mining tasks are still to be extensively investigated. This work introduces a framework leveraging LLMs and Retrieval-Augmented Generation to enhance their capabilities for predicting next activities. By leveraging sequential information and data attributes from past execution traces, our framework enables LLMs to make more accurate predictions without additional training. We evaluate the approach on a wide range of event logs and compare it with state-of-the-art techniques. Findings show that our framework achieves competitive performance while being more adaptable across domains. Moreover, we assess early prediction capabilities, validate the significance of observed differences through statistical testing, and explore the impact of fine-tuning. Despite these advantages, we also report the framework’s limitations, mainly related to interleaving activity sensitivity and concept drifts. Our findings highlight the potential of retrieval-augmented LLMs in PPM while identifying the need for future research into handling evolving process behaviors and the development of standard benchmarks.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102642"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145435655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2025-11-04DOI: 10.1016/j.is.2025.102644
Massimiliano de Leoni , Wil M.P. van der Aalst , Marcus Dees
This contribution revisits our article titled “A General Process Mining Framework for Correlating, Predicting, and Clustering Dynamic Behavior Based on Event Logs”, published in the Information Systems journal in 2016. It reflects on how the proposed general framework for process mining has grown in relevance with the rise of AI, emphasizing its value as a extensible approach to transforming event data into analytical and predictive insights. It also discusses how the framework relevance and the underlying message remains valid, including for emerging research directions such as prescriptive analytics, causal and/or object-centric process mining.
{"title":"Nine years later: Reflecting on our article","authors":"Massimiliano de Leoni , Wil M.P. van der Aalst , Marcus Dees","doi":"10.1016/j.is.2025.102644","DOIUrl":"10.1016/j.is.2025.102644","url":null,"abstract":"<div><div>This contribution revisits our article titled <em>“A General Process Mining Framework for Correlating, Predicting, and Clustering Dynamic Behavior Based on Event Logs”</em>, published in the <em>Information Systems</em> journal in 2016. It reflects on how the proposed general framework for process mining has grown in relevance with the rise of AI, emphasizing its value as a extensible approach to transforming event data into analytical and predictive insights. It also discusses how the framework relevance and the underlying message remains valid, including for emerging research directions such as prescriptive analytics, causal and/or object-centric process mining.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102644"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2025-11-04DOI: 10.1016/j.is.2025.102641
Andrei Tour , Artem Polyvyanyy , Anna Kalenkova
In Process Mining (PM), “high-level” conceptual models of business processes, in the form of directly-follows graphs, Petri nets, and finite-state automata, are discovered from “low-level” event data recorded by information systems. The quality of the discovered models is usually assessed by measures that depend on assumptions made by discovery algorithms; for example, they often assume that sequences of activities recorded in the event data do not interfere. Models produced by recent discovery algorithms consider domain knowledge and relax these assumptions, making traditional PM measures less suitable for evaluating their quality. This paper proposes an ontology-aware framework, called SOLID-M, for analyzing the quality of conceptual models discovered from event data generated by systems. SOLID-M relies on domain knowledge and provides guidelines for introducing quality measures for models constructed by process discovery algorithms that go beyond the traditional PM assumptions. In addition, the paper describes an instantiation of the framework for assessing the quality of Multi-Agent System models discovered using Agent System Mining techniques, hence addressing a growing demand for data-driven analysis of business processes emerging in interactions of human and artificial intelligence agents.
{"title":"SOLID-M: An ontology-aware quality framework for conceptual models discovered from event data","authors":"Andrei Tour , Artem Polyvyanyy , Anna Kalenkova","doi":"10.1016/j.is.2025.102641","DOIUrl":"10.1016/j.is.2025.102641","url":null,"abstract":"<div><div>In Process Mining (PM), “high-level” conceptual models of business processes, in the form of directly-follows graphs, Petri nets, and finite-state automata, are discovered from “low-level” event data recorded by information systems. The quality of the discovered models is usually assessed by measures that depend on assumptions made by discovery algorithms; for example, they often assume that sequences of activities recorded in the event data do not interfere. Models produced by recent discovery algorithms consider domain knowledge and relax these assumptions, making traditional PM measures less suitable for evaluating their quality. This paper proposes an ontology-aware framework, called SOLID-M, for analyzing the quality of conceptual models discovered from event data generated by systems. SOLID-M relies on domain knowledge and provides guidelines for introducing quality measures for models constructed by process discovery algorithms that go beyond the traditional PM assumptions. In addition, the paper describes an instantiation of the framework for assessing the quality of Multi-Agent System models discovered using Agent System Mining techniques, hence addressing a growing demand for data-driven analysis of business processes emerging in interactions of human and artificial intelligence agents.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102641"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2025-11-21DOI: 10.1016/j.is.2025.102648
Michael E. Houle, Vincent Oria, Hamideh Sabaei
Fixed-point iteration (FPI) is a crucially important technique at the foundation of many scientific and engineering fields, such as numerical analysis, dynamical systems, optimization, and machine learning. In these domains, algorithmic efficiency and stability is often assessed using the notion of convergence order, a quantity whose estimation has typically involved line fitting in log–log space, or finding the limit of an associated function on differences of sequence values. In this paper, we establish a precise equivalence between the convergence order of a fixed-point update function and the local intrinsic dimensionality (LID) of that function once its fixed point is translated to the origin. Building on this insight, we propose a unified framework for re-purposing existing distributional estimators of LID to estimate the convergence order. Of the LID estimators considered, we show that two, the MLE (Hill) estimator and a Bayesian estimator, have practical and convenient closed-form expressions. We further investigate how these estimators of convergence order can be enhanced using Aitken’s method for accelerating convergence in slow scenarios, as well as a Bayesian smoothing layer for reducing variance when the number of samples is small. Empirically, we benchmark our LID-based estimators against classical sequenced-based and curve-fitting methods in three experimental settings: root-finding, general iteration, and machine learning regression. Results indicate that our approaches frequently match or surpass the classical estimators in accuracy, while offering robust performance over a broader range of convergence scenarios.
{"title":"Local intrinsic dimensionality and the estimation of convergence order","authors":"Michael E. Houle, Vincent Oria, Hamideh Sabaei","doi":"10.1016/j.is.2025.102648","DOIUrl":"10.1016/j.is.2025.102648","url":null,"abstract":"<div><div>Fixed-point iteration (FPI) is a crucially important technique at the foundation of many scientific and engineering fields, such as numerical analysis, dynamical systems, optimization, and machine learning. In these domains, algorithmic efficiency and stability is often assessed using the notion of convergence order, a quantity whose estimation has typically involved line fitting in log–log space, or finding the limit of an associated function on differences of sequence values. In this paper, we establish a precise equivalence between the convergence order of a fixed-point update function and the local intrinsic dimensionality (LID) of that function once its fixed point is translated to the origin. Building on this insight, we propose a unified framework for re-purposing existing distributional estimators of LID to estimate the convergence order. Of the LID estimators considered, we show that two, the MLE (Hill) estimator and a Bayesian estimator, have practical and convenient closed-form expressions. We further investigate how these estimators of convergence order can be enhanced using Aitken’s <span><math><msup><mrow><mi>Δ</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> method for accelerating convergence in slow scenarios, as well as a Bayesian smoothing layer for reducing variance when the number of samples is small. Empirically, we benchmark our LID-based estimators against classical sequenced-based and curve-fitting methods in three experimental settings: root-finding, general iteration, and machine learning regression. Results indicate that our approaches frequently match or surpass the classical estimators in accuracy, while offering robust performance over a broader range of convergence scenarios.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102648"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01Epub Date: 2025-11-12DOI: 10.1016/j.is.2025.102643
Reza Shafiloo, Maria Stratigi, Jaakko Peltonen, Thomas Olsson, Kostas Stefanidis
Autonomous decision-making systems, particularly recommender systems, have received increasing attention concerning fairness, i.e., if all stakeholders affected by such a system are treated equally as a result of the recommendations. Existing approaches primarily focus on fairness between two stakeholders – consumers and providers or consumers and items – treating providers and items as the same entity. However, we argue for the treatment of providers and items as distinct stakeholders to offer more comprehensive models of fairness in recommender systems. To this end, we propose a fairness-aware recommender system, CIPFRS, designed to optimize fairness across all three key stakeholders: consumers, providers, and items. We examine consumer fairness regarding their level of interaction with the system; high and low-activity users should be treated equally. Further, all providers should have an equal opportunity for their products to be recommended. Finally, we propose an approach to implement item fairness in each provider’s inventory. We report an extensive evaluation of the proposed solution through three datasets, demonstrating that considering all three stakeholders yields improved recommendations while minimizing bias.
{"title":"The many facets of fairness in recommender systems: Consumers, providers and items","authors":"Reza Shafiloo, Maria Stratigi, Jaakko Peltonen, Thomas Olsson, Kostas Stefanidis","doi":"10.1016/j.is.2025.102643","DOIUrl":"10.1016/j.is.2025.102643","url":null,"abstract":"<div><div>Autonomous decision-making systems, particularly recommender systems, have received increasing attention concerning fairness, i.e., if all stakeholders affected by such a system are treated equally as a result of the recommendations. Existing approaches primarily focus on fairness between two stakeholders – consumers and providers or consumers and items – treating providers and items as the same entity. However, we argue for the treatment of providers and items as distinct stakeholders to offer more comprehensive models of fairness in recommender systems. To this end, we propose a fairness-aware recommender system, CIPFRS, designed to optimize fairness across all three key stakeholders: consumers, providers, and items. We examine consumer fairness regarding their level of interaction with the system; high and low-activity users should be treated equally. Further, all providers should have an equal opportunity for their products to be recommended. Finally, we propose an approach to implement item fairness in each provider’s inventory. We report an extensive evaluation of the proposed solution through three datasets, demonstrating that considering all three stakeholders yields improved recommendations while minimizing bias.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"137 ","pages":"Article 102643"},"PeriodicalIF":3.4,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-10-28DOI: 10.1016/j.is.2025.102639
Zhen Jiang, Bolin Niu, Jinxin Gua, Yuping Xing
Density-based clustering is capable of identifying clusters of arbitrary shapes without the need to predefine the number of clusters or their distributions. However, it suffers from varying density and parameter sensitivity. To tackle these challenges, we present the Density and Distance-Based Clustering (DDBC) algorithm, which performs clustering from the backbone to the foliage. Based on the “K_cutoff” neighborhoods of core points, DDBC constructs the cluster backbone through label propagation and subcluster aggregation. Subsequently, we construct cluster prototypes and leverage point-prototype distances to help assign points located outside the backbone. The proposed method effectively mitigates issues related to varying density. Furthermore, we propose a semi-supervised version of DDBC, termed SS-DDBC, which utilizes a few labeled data to guide label propagation and subcluster aggregation. It provides a safe and adaptive approach to leverage class information for semi-supervised clustering. Moreover, we propose automated parameter optimization approaches for DDBC and SS-DDBC, thus addressing the issue of parameter sensitivity. In both unsupervised and semi-supervised settings, we conducted experimental comparisons of DDBC and SS-DDBC with ten state-of-the-art algorithms across a range of benchmark datasets. Both algorithms consistently outperform their competitors in terms of average performance and achieve superior results on the majority of datasets. These experimental results demonstrate the effectiveness of our proposed methods. The source codes for our algorithms are accessible at https://github.com/nblnbl/DDBC.
{"title":"Unsupervised and semi-supervised clustering via density and distance-based label propagation and assignment","authors":"Zhen Jiang, Bolin Niu, Jinxin Gua, Yuping Xing","doi":"10.1016/j.is.2025.102639","DOIUrl":"10.1016/j.is.2025.102639","url":null,"abstract":"<div><div>Density-based clustering is capable of identifying clusters of arbitrary shapes without the need to predefine the number of clusters or their distributions. However, it suffers from varying density and parameter sensitivity. To tackle these challenges, we present the Density and Distance-Based Clustering (DDBC) algorithm, which performs clustering from the backbone to the foliage. Based on the “K_cutoff” neighborhoods of core points, DDBC constructs the cluster backbone through label propagation and subcluster aggregation. Subsequently, we construct cluster prototypes and leverage point-prototype distances to help assign points located outside the backbone. The proposed method effectively mitigates issues related to varying density. Furthermore, we propose a semi-supervised version of DDBC, termed SS-DDBC, which utilizes a few labeled data to guide label propagation and subcluster aggregation. It provides a safe and adaptive approach to leverage class information for semi-supervised clustering. Moreover, we propose automated parameter optimization approaches for DDBC and SS-DDBC, thus addressing the issue of parameter sensitivity. In both unsupervised and semi-supervised settings, we conducted experimental comparisons of DDBC and SS-DDBC with ten state-of-the-art algorithms across a range of benchmark datasets. Both algorithms consistently outperform their competitors in terms of average performance and achieve superior results on the majority of datasets. These experimental results demonstrate the effectiveness of our proposed methods. The source codes for our algorithms are accessible at <span><span>https://github.com/nblnbl/DDBC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"136 ","pages":"Article 102639"},"PeriodicalIF":3.4,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-09-17DOI: 10.1016/j.is.2025.102628
Matteo Francia , Stefano Rizzi , Matteo Golfarelli , Patrick Marcel
In an attempt to streamline exploratory data analysis of multidimensional cubes, the Intentional Analytics Model ha been proposed as a way to unite OLAP and analytics by allowing users to indicate their analysis intentions and returning cubes enhanced with models. Five intention operators were envisioned to this end; in this work we focus on the predict operator, whose goal is to estimate the missing values of a cube measure starting from known values of the same measure or other measures using different regression models. Although prediction tasks such as forecasting and imputation are routinary for analysts, the added value of our approach is (i) to encapsulate them in a declarative, concise, natural language-like syntax; (ii) to automate the selection of the best measures to be used and the computation of the models, and (iii) to automate the evaluation of the interest of the models computed. First we propose a syntax and a semantics for predict and discuss how enhanced cubes are built by (i) predicting the missing values for a measure based on the available information via one or more models and (ii) highlighting the most interesting prediction. Then we test the operator implementation, proving that its performance is in line with the interactivity requirement of OLAP session and that accurate predictions can be returned.
{"title":"Predicting multidimensional cubes through intentional analytics","authors":"Matteo Francia , Stefano Rizzi , Matteo Golfarelli , Patrick Marcel","doi":"10.1016/j.is.2025.102628","DOIUrl":"10.1016/j.is.2025.102628","url":null,"abstract":"<div><div>In an attempt to streamline exploratory data analysis of multidimensional cubes, the Intentional Analytics Model ha been proposed as a way to unite OLAP and analytics by allowing users to indicate their analysis intentions and returning cubes enhanced with models. Five intention operators were envisioned to this end; in this work we focus on the <span>predict</span> operator, whose goal is to estimate the missing values of a cube measure starting from known values of the same measure or other measures using different regression models. Although prediction tasks such as forecasting and imputation are routinary for analysts, the added value of our approach is (i) to encapsulate them in a declarative, concise, natural language-like syntax; (ii) to automate the selection of the best measures to be used and the computation of the models, and (iii) to automate the evaluation of the interest of the models computed. First we propose a syntax and a semantics for <span>predict</span> and discuss how enhanced cubes are built by (i) predicting the missing values for a measure based on the available information via one or more models and (ii) highlighting the most interesting prediction. Then we test the operator implementation, proving that its performance is in line with the interactivity requirement of OLAP session and that accurate predictions can be returned.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"136 ","pages":"Article 102628"},"PeriodicalIF":3.4,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145106045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-09-18DOI: 10.1016/j.is.2025.102620
Maikel Leon
The rapid evolution of Generative Pre-trained Transformers (GPTs) has revolutionized natural language processing, enabling models to generate coherent text, solve mathematical problems, write code, and even reason about complex tasks. This paper presents a scientific review of GPT-5, OpenAI’s latest flagship model, and examines its innovations in comparison to previous generations of GPT. We summarize the model’s architecture and features, including hierarchical routing, expanded context windows, and enhanced tool-use capabilities, and survey empirical evidence of improved performance on academic benchmarks. A dedicated section discusses the release of open-weight mixture-of-experts models (GPT-OSS), describing their technical design, licensing, and comparative performance. Our analysis synthesizes findings from recent literature on long-context evaluation, cognitive biases, medical summarization, and hallucination vulnerability, highlighting where GPT-5 advances the state of the art and where challenges remain. We conclude by discussing the implications of open-weight models for transparency and reproducibility and propose directions for future research on evaluation, safety, and agentic behavior.
{"title":"GPT-5 and open-weight large language models: Advances in reasoning, transparency, and control","authors":"Maikel Leon","doi":"10.1016/j.is.2025.102620","DOIUrl":"10.1016/j.is.2025.102620","url":null,"abstract":"<div><div>The rapid evolution of Generative Pre-trained Transformers (GPTs) has revolutionized natural language processing, enabling models to generate coherent text, solve mathematical problems, write code, and even reason about complex tasks. This paper presents a scientific review of GPT-5, OpenAI’s latest flagship model, and examines its innovations in comparison to previous generations of GPT. We summarize the model’s architecture and features, including hierarchical routing, expanded context windows, and enhanced tool-use capabilities, and survey empirical evidence of improved performance on academic benchmarks. A dedicated section discusses the release of open-weight mixture-of-experts models (GPT-OSS), describing their technical design, licensing, and comparative performance. Our analysis synthesizes findings from recent literature on long-context evaluation, cognitive biases, medical summarization, and hallucination vulnerability, highlighting where GPT-5 advances the state of the art and where challenges remain. We conclude by discussing the implications of open-weight models for transparency and reproducibility and propose directions for future research on evaluation, safety, and agentic behavior.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"136 ","pages":"Article 102620"},"PeriodicalIF":3.4,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145106042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-09-11DOI: 10.1016/j.is.2025.102611
Eric M. Osterkamp , Dominik Köppl
The Burrows–Wheeler transform (BWT) lies at the heart of succinct and compressed full-text indexes for pattern matching queries. Notable variants are (a) the extended BWT (eBWT) capable to index multiple circular texts for pattern matching, or (b) the parameterized BWT (pBWT) for parameterized pattern matching. A natural extension is the combination of the virtues of both variants into a new data structure, whose name we coin with extended parameterized BWT (epBWT). We show that the epBWT supports pattern matching in context of parameterized pattern matching on multiple circular texts, within the same complexities as known solutions presented for the pBWT [Kim and Cho, IPL’21] for patterns not longer than the shortest indexed text. Additionally, we show how to compute the epBWT within the same complexities as [Iseri et al., ICALP’24], i.e., in compact space and quasilinear time. As an application, we extend the matching statistics problem to the parameterized pattern matching setting on circular texts.
{"title":"Extended parameterized Burrows–Wheeler transform","authors":"Eric M. Osterkamp , Dominik Köppl","doi":"10.1016/j.is.2025.102611","DOIUrl":"10.1016/j.is.2025.102611","url":null,"abstract":"<div><div>The Burrows–Wheeler transform (BWT) lies at the heart of succinct and compressed full-text indexes for pattern matching queries. Notable variants are (a) the extended BWT (eBWT) capable to index multiple circular texts for pattern matching, or (b) the parameterized BWT (pBWT) for parameterized pattern matching. A natural extension is the combination of the virtues of both variants into a new data structure, whose name we coin with <em>extended parameterized BWT</em> (epBWT). We show that the epBWT supports pattern matching in context of parameterized pattern matching on multiple circular texts, within the same complexities as known solutions presented for the pBWT [Kim and Cho, IPL’21] for patterns not longer than the shortest indexed text. Additionally, we show how to compute the epBWT within the same complexities as [Iseri et al., ICALP’24], i.e., in compact space and quasilinear time. As an application, we extend the matching statistics problem to the parameterized pattern matching setting on circular texts.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"136 ","pages":"Article 102611"},"PeriodicalIF":3.4,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145106044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}