Pub Date : 2026-01-29DOI: 10.1016/j.is.2026.102695
Giuseppe Destefanis , Jiahua Xu , Silvia Bartolucci
Decentralised finance (DeFi) protocols often claim to implement decentralised governance via mechanisms such as decentralised autonomous organisations (DAOs), yet the structure of their development processes is rarely examined in detail. This study presents an in-depth case analysis of the development activity distribution in Lido, a prominent DeFi liquid staking protocol. We analyse 6741 human-generated GitHub actions recorded from September 2020 to February 2025. Using standard inequality metrics – Gini coefficient and Herfindahl–Hirschman Index – alongside contributors’ interaction network and core–periphery modelling, we find that development activity is highly concentrated. Overall, the weighted Gini coefficient reaches 0.82 and the most active contributor alone accounts for 24% of the total activity. Despite an even split between core and peripheral contributors, the core group accounts for 98.1% of all weighted development actions. The temporal analysis shows an increase in concentration over time, with the Gini coefficient rising from 0.686 in the bootstrap phase to 0.817 in the maturity phase. The contributors’ interaction network analysis reveals a hub-and-spoke structure with high centralisation in communication flows. While a case study of a single protocol, Lido represents a critical test of decentralisation claims given its prominence, maturity, and DAO governance structure. These findings demonstrate that open-source DeFi development can exhibit highly concentrated control patterns despite decentralised governance mechanisms, revealing a persistent gap between governance and operational decentralisation.
{"title":"Measuring the decentralisation of DeFi development: An empirical analysis of contributor distribution in Lido","authors":"Giuseppe Destefanis , Jiahua Xu , Silvia Bartolucci","doi":"10.1016/j.is.2026.102695","DOIUrl":"10.1016/j.is.2026.102695","url":null,"abstract":"<div><div>Decentralised finance (DeFi) protocols often claim to implement decentralised governance via mechanisms such as decentralised autonomous organisations (DAOs), yet the structure of their development processes is rarely examined in detail. This study presents an in-depth case analysis of the development activity distribution in Lido, a prominent DeFi liquid staking protocol. We analyse 6741 human-generated GitHub actions recorded from September 2020 to February 2025. Using standard inequality metrics – Gini coefficient and Herfindahl–Hirschman Index – alongside contributors’ interaction network and core–periphery modelling, we find that development activity is highly concentrated. Overall, the weighted Gini coefficient reaches 0.82 and the most active contributor alone accounts for 24% of the total activity. Despite an even split between core and peripheral contributors, the core group accounts for 98.1% of all weighted development actions. The temporal analysis shows an increase in concentration over time, with the Gini coefficient rising from 0.686 in the bootstrap phase to 0.817 in the maturity phase. The contributors’ interaction network analysis reveals a hub-and-spoke structure with high centralisation in communication flows. While a case study of a single protocol, Lido represents a critical test of decentralisation claims given its prominence, maturity, and DAO governance structure. These findings demonstrate that open-source DeFi development can exhibit highly concentrated control patterns despite decentralised governance mechanisms, revealing a persistent gap between governance and operational decentralisation.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"139 ","pages":"Article 102695"},"PeriodicalIF":3.4,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1016/j.is.2026.102694
Riccardo Lo Bianco , Willem van Jaarsveld , Jeroen Middelhuis , Luca Begnardi , Remco Dijkman
The Dynamic Task Assignment Problem (DTAP) concerns matching resources to tasks in real time while minimizing some objectives, like resource costs or task cycle time. In this work, we consider a DTAP variant where every task is a case composed of a stochastic sequence of activities. The DTAP, in this case, involves the decision of which employee to assign to which activity to process requests as quickly as possible. In recent years, Deep Reinforcement Learning (DRL) has emerged as a promising tool for tackling this DTAP variant, but most research is limited to solving small-scale, synthetic problems, neglecting the challenges posed by real-world use cases. To bridge this gap, this work proposes a DRL-based Decision Support System (DSS) for real-world scale DTAPs. To this end, we introduce a DRL agent with two novel elements: a graph structure for observations and actions that can effectively represent any DTAP and a reward function that is provably equivalent to the objective of minimizing the average cycle time of tasks. The combination of these two novelties allows the agent to learn effective and generalizable assignment policies for real-world scale DTAPs. The proposed DSS is evaluated on five DTAP instances whose parameters are extracted from real-world logs through process mining. The experimental evaluation shows how the proposed DRL agent matches or outperforms the best baseline in all DTAP instances and generalizes on different time horizons and across instances.
{"title":"Automated decision-making for dynamic task assignment at scale","authors":"Riccardo Lo Bianco , Willem van Jaarsveld , Jeroen Middelhuis , Luca Begnardi , Remco Dijkman","doi":"10.1016/j.is.2026.102694","DOIUrl":"10.1016/j.is.2026.102694","url":null,"abstract":"<div><div>The Dynamic Task Assignment Problem (DTAP) concerns matching resources to tasks in real time while minimizing some objectives, like resource costs or task cycle time. In this work, we consider a DTAP variant where every task is a case composed of a stochastic sequence of activities. The DTAP, in this case, involves the decision of which employee to assign to which activity to process requests as quickly as possible. In recent years, Deep Reinforcement Learning (DRL) has emerged as a promising tool for tackling this DTAP variant, but most research is limited to solving small-scale, synthetic problems, neglecting the challenges posed by real-world use cases. To bridge this gap, this work proposes a DRL-based Decision Support System (DSS) for real-world scale DTAPs. To this end, we introduce a DRL agent with two novel elements: a graph structure for observations and actions that can effectively represent any DTAP and a reward function that is provably equivalent to the objective of minimizing the average cycle time of tasks. The combination of these two novelties allows the agent to learn effective and generalizable assignment policies for real-world scale DTAPs. The proposed DSS is evaluated on five DTAP instances whose parameters are extracted from real-world logs through process mining. The experimental evaluation shows how the proposed DRL agent matches or outperforms the best baseline in all DTAP instances and generalizes on different time horizons and across instances.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"138 ","pages":"Article 102694"},"PeriodicalIF":3.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human–data interaction (HDI) presents fundamentally different challenges from traditional data management. HDI systems must meet latency, correctness, and consistency needs that stem from usability rather than query semantics; failing to meet these expectations breaks the user experience. Moreover, interfaces and systems are tightly coupled; neither can easily be optimized in isolation, and effective solutions demand their co-design. This dependence also presents a research opportunity: rather than adapt systems to interface demands, systems innovations and database theory can also inspire new interaction and visualization designs. We survey a decade of our lab’s work that embraces this coupling and argue that HDI systems are the foundation for reliable, interactive, AI-driven applications.
{"title":"A decade of systems for human data interaction","authors":"Eugene Wu , Yiru Chen , Haneen Mohammed , Zezhou Huang","doi":"10.1016/j.is.2026.102689","DOIUrl":"10.1016/j.is.2026.102689","url":null,"abstract":"<div><div>Human–data interaction (HDI) presents fundamentally different challenges from traditional data management. HDI systems must meet latency, correctness, and consistency needs that stem from usability rather than query semantics; failing to meet these expectations breaks the user experience. Moreover, interfaces and systems are tightly coupled; neither can easily be optimized in isolation, and effective solutions demand their co-design. This dependence also presents a research opportunity: rather than adapt systems to interface demands, systems innovations and database theory can also inspire new interaction and visualization designs. We survey a decade of our lab’s work that embraces this coupling and argue that HDI systems are the foundation for reliable, interactive, AI-driven applications.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"138 ","pages":"Article 102689"},"PeriodicalIF":3.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1016/j.is.2026.102690
Ruoyu Wang , Raymond Wong , Daniel Sun
Logic rule mining discovers association patterns in the form of logic rules from structured data. Logic rules are widely applied in information systems to assist decisions in an interpretable way. However, too many computational resources are required in state-of-the-art systems, as most of these systems optimize rule mining algorithms from the perspectives of algorithms and architecture, while data efficiency has been overlooked. Although some start-of-the-art systems implement customized data structures to improve mining speed, the space overhead of the data structures is unaffordable when processing large-scale knowledge bases. Therefore, in this article, we propose data structures to improve data efficiency and accelerate logic rule mining. Our techniques implicitly represent the Cartesian product of variable substitutions in logic rules and build compact indices for a logic entailment cache. Furthermore, we create a pool and a lookup table for the cache so that cache components will not be repeatedly created. The evaluation results show that over 95% of memory can be reduced by our techniques, and mining procedures have been accelerated by about 20x on average. Most importantly, mining on large-scale knowledge bases is practical on normal hardware where only one thread and 20GB of memory are sufficient even for large-scale knowledge bases.
{"title":"Efficient data structures for fast and low-cost first-order logic rule mining","authors":"Ruoyu Wang , Raymond Wong , Daniel Sun","doi":"10.1016/j.is.2026.102690","DOIUrl":"10.1016/j.is.2026.102690","url":null,"abstract":"<div><div>Logic rule mining discovers association patterns in the form of logic rules from structured data. Logic rules are widely applied in information systems to assist decisions in an interpretable way. However, too many computational resources are required in state-of-the-art systems, as most of these systems optimize rule mining algorithms from the perspectives of algorithms and architecture, while data efficiency has been overlooked. Although some start-of-the-art systems implement customized data structures to improve mining speed, the space overhead of the data structures is unaffordable when processing large-scale knowledge bases. Therefore, in this article, we propose data structures to improve data efficiency and accelerate logic rule mining. Our techniques implicitly represent the Cartesian product of variable substitutions in logic rules and build compact indices for a logic entailment cache. Furthermore, we create a pool and a lookup table for the cache so that cache components will not be repeatedly created. The evaluation results show that over 95% of memory can be reduced by our techniques, and mining procedures have been accelerated by about 20x on average. Most importantly, mining on large-scale knowledge bases is practical on normal hardware where only one thread and 20GB of memory are sufficient even for large-scale knowledge bases.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"139 ","pages":"Article 102690"},"PeriodicalIF":3.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1016/j.is.2026.102693
Yongming Guan , Chengdong Zheng , Yuliang Shi , Gang Wang , Linfeng Wu , Zhiyong Chen , Hui Li
Artificial intelligence (AI) has demonstrated transformative potential in diverse fields such as healthcare, drug discovery, and natural language processing by enabling advanced pattern recognition and predictive modeling of complex data. Particularly in the power system, where it involves areas such as power load, electricity price, and renewable energy, the application of AI technology to enhance the multivariate electricity time series forecasting tasks is crucial for grid security and economic dispatch. In power systems, multivariate electricity time series forecasting tasks involving power load, electricity prices, and renewable energy are crucial for grid security and economic dispatch. Contemporary forecasting approaches primarily focus on two aspects: modeling multi-scale periodic characteristics within sequences and capturing complex collaborative dependencies among variables. However, existing techniques often fail to simultaneously disentangle multi-scale features and model the dynamically heterogeneous dependencies between variables. To overcome these limitations, this paper proposes MDU-Net, a novel forecasting framework. The framework comprises two core modules: Multi-resolution hierarchical Union learning (MRU) module and Differential Channel Clustering Fusion (DCCF) Module. The MRU module constructs multi-granularity temporal representations through downsampling and achieves effective cross-scale feature fusion by integrating channel-independent operations with seasonal-trend decomposition. The DCCF module adopts first- and second-order derivative approximations to generate soft clustering mask matrices, adaptively capturing asymmetric collaborative dependencies among different variables over time. Experimental results on multiple public datasets (ETT, Electricity) demonstrate that MDU-Net significantly outperforms state-of-the-art baselines in multivariate electricity time series prediction. it achieves 2.7% and 17.1% relative MSE reductions compared to TimeMixer and PatchTST, respectively, with 1.4% and 14.4% lower MAE. Notably, MDU-Net maintains strong generalization capabilities and computational efficiency. The framework also shows promising performance in cross-domain applications such as traffic forecasting.
{"title":"MDU-Net: Multi-resolution learning and differential clustering fusion for multivariate electricity time series forecasting","authors":"Yongming Guan , Chengdong Zheng , Yuliang Shi , Gang Wang , Linfeng Wu , Zhiyong Chen , Hui Li","doi":"10.1016/j.is.2026.102693","DOIUrl":"10.1016/j.is.2026.102693","url":null,"abstract":"<div><div>Artificial intelligence (AI) has demonstrated transformative potential in diverse fields such as healthcare, drug discovery, and natural language processing by enabling advanced pattern recognition and predictive modeling of complex data. Particularly in the power system, where it involves areas such as power load, electricity price, and renewable energy, the application of AI technology to enhance the multivariate electricity time series forecasting tasks is crucial for grid security and economic dispatch. In power systems, multivariate electricity time series forecasting tasks involving power load, electricity prices, and renewable energy are crucial for grid security and economic dispatch. Contemporary forecasting approaches primarily focus on two aspects: modeling multi-scale periodic characteristics within sequences and capturing complex collaborative dependencies among variables. However, existing techniques often fail to simultaneously disentangle multi-scale features and model the dynamically heterogeneous dependencies between variables. To overcome these limitations, this paper proposes MDU-Net, a novel forecasting framework. The framework comprises two core modules: Multi-resolution hierarchical Union learning (MRU) module and Differential Channel Clustering Fusion (DCCF) Module. The MRU module constructs multi-granularity temporal representations through downsampling and achieves effective cross-scale feature fusion by integrating channel-independent operations with seasonal-trend decomposition. The DCCF module adopts first- and second-order derivative approximations to generate soft clustering mask matrices, adaptively capturing asymmetric collaborative dependencies among different variables over time. Experimental results on multiple public datasets (ETT, Electricity) demonstrate that MDU-Net significantly outperforms state-of-the-art baselines in multivariate electricity time series prediction. it achieves 2.7% and 17.1% relative MSE reductions compared to TimeMixer and PatchTST, respectively, with 1.4% and 14.4% lower MAE. Notably, MDU-Net maintains strong generalization capabilities and computational efficiency. The framework also shows promising performance in cross-domain applications such as traffic forecasting.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"138 ","pages":"Article 102693"},"PeriodicalIF":3.4,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1016/j.is.2026.102691
Tim Baccaert, Bas Ketsman
In most asynchronous distributed systems, consistency is achieved by use of coordination protocols such as Paxos, Raft, and 2PC. In many settings such protocols are too slow, too difficult to implement, or practically infeasible. The CALM theorem, initially conjectured by Hellerstein, is one of the first results characterizing precisely which problems do not require such a coordination protocol. It states that a problem has a consistent, coordination-free distributed implementation if, and only if, the problem is monotone. This was proven for deterministic problems (i.e., queries) and extends slightly beyond monotone queries for systems in which nodes can consult the data partitioning strategy.
In this work, we generalize the CALM Theorem to work for non-deterministic problems such as leader election. Furthermore, we make the theorem applicable to a wider range of distributed systems. The prior variants of the theorem have only-if directions requiring that systems may only access their identifier in the network, the identifiers of other nodes, and the data partitioning strategy. Our generalization allows us to model systems with arbitrary shared information between the nodes (e.g., network topology, leader nodes, …). It additionally allows us to create a coordination spectrum that classifies how much coordination a problem requires based on how much shared information is needed to compute it. Lastly, we apply this generalized theorem to show that the classes of polynomial time problems and coordination-free problems are not equal.
{"title":"A Generalized CALM Theorem for Non-Deterministic Computation in Asynchronous Distributed Systems","authors":"Tim Baccaert, Bas Ketsman","doi":"10.1016/j.is.2026.102691","DOIUrl":"10.1016/j.is.2026.102691","url":null,"abstract":"<div><div>In most asynchronous distributed systems, consistency is achieved by use of coordination protocols such as Paxos, Raft, and 2PC. In many settings such protocols are too slow, too difficult to implement, or practically infeasible. The CALM theorem, initially conjectured by Hellerstein, is one of the first results characterizing precisely which problems do not require such a coordination protocol. It states that a problem has a consistent, coordination-free distributed implementation if, and only if, the problem is monotone. This was proven for deterministic problems (i.e., queries) and extends slightly beyond monotone queries for systems in which nodes can consult the data partitioning strategy.</div><div>In this work, we generalize the CALM Theorem to work for non-deterministic problems such as leader election. Furthermore, we make the theorem applicable to a wider range of distributed systems. The prior variants of the theorem have only-if directions requiring that systems may only access their identifier in the network, the identifiers of other nodes, and the data partitioning strategy. Our generalization allows us to model systems with arbitrary shared information between the nodes (e.g., network topology, leader nodes, …). It additionally allows us to create a coordination spectrum that classifies how much coordination a problem requires based on how much shared information is needed to compute it. Lastly, we apply this generalized theorem to show that the classes of polynomial time problems and coordination-free problems are not equal.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"138 ","pages":"Article 102691"},"PeriodicalIF":3.4,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1016/j.is.2026.102692
Dimitrios Karapiperis , Christos Tjortjis , Vassilios S. Verykios
Record linkage holds a crucial position in data management and analysis by identifying and merging records from disparate data sets that pertain to the same real-world entity. As data volumes grow, the intricacies of record linkage amplify, presenting challenges, such as potential redundancies and computational complexities. This paper introduces DynaHash, a novel randomized record linkage mechanism that utilizes (a) the MinHash technique to generate compact representations of blocking keys and (b) Hamming Locality-Sensitive Hashing (LSH) to construct the blocking structure from these vectors. By employing these methods, DynaHash offers theoretical guarantees of accuracy and achieves sublinear runtime complexities, with appropriate parameter tuning. It comprises two key components: a persistent storage system for permanently storing the blocking structure to ensure complete results, and an in-memory component for generating very fast partial results by summarizing the persisted blocking structure. Additionally, DynaHash leverages Multi-Probe matching to scan multiple neighboring blocks, in terms of their Hamming distances, in order to find matches. Our theoretical work derives a decrease factor in the space requirements, which depends on the Hamming threshold, compared with the baseline LSH. Our experimental evaluation against three state-of-the-art methods on six real-world data sets demonstrates DynaHash’s exceptional recall rates and query times, which are at least faster than its competitors and do not depend on the size of the underlying data sets.
{"title":"DynaHash: An efficient blocking structure for streaming record linkage","authors":"Dimitrios Karapiperis , Christos Tjortjis , Vassilios S. Verykios","doi":"10.1016/j.is.2026.102692","DOIUrl":"10.1016/j.is.2026.102692","url":null,"abstract":"<div><div>Record linkage holds a crucial position in data management and analysis by identifying and merging records from disparate data sets that pertain to the same real-world entity. As data volumes grow, the intricacies of record linkage amplify, presenting challenges, such as potential redundancies and computational complexities. This paper introduces DynaHash, a novel randomized record linkage mechanism that utilizes (a) the MinHash technique to generate compact representations of blocking keys and (b) Hamming Locality-Sensitive Hashing (LSH) to construct the blocking structure from these vectors. By employing these methods, DynaHash offers theoretical guarantees of accuracy and achieves sublinear runtime complexities, with appropriate parameter tuning. It comprises two key components: a persistent storage system for permanently storing the blocking structure to ensure complete results, and an in-memory component for generating very fast partial results by summarizing the persisted blocking structure. Additionally, DynaHash leverages Multi-Probe matching to scan multiple neighboring blocks, in terms of their Hamming distances, in order to find matches. Our theoretical work derives a decrease factor in the space requirements, which depends on the Hamming threshold, compared with the baseline LSH. Our experimental evaluation against three state-of-the-art methods on six real-world data sets demonstrates DynaHash’s exceptional recall rates and query times, which are at least <span><math><mrow><mn>2</mn><mo>×</mo></mrow></math></span> faster than its competitors and do not depend on the size of the underlying data sets.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"138 ","pages":"Article 102692"},"PeriodicalIF":3.4,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145977428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1016/j.is.2026.102687
Anna Fariha , Lucy Cousins , Narges Mahyar , Alexandra Meliou
Traditional relational data interfaces require precise structured queries over potentially complex schemas. These rigid data retrieval mechanisms pose hurdles for nonexpert users, who typically lack programming language expertise and are unfamiliar with the details of the schema. Existing tools assist in formulating queries through keyword search, query recommendation, and query auto-completion, but still require some technical expertise. An alternative method for accessing data is query by example (QBE), where users express their data exploration intent simply by providing examples of their intended data and the system infers the intended query. However, existing QBE approaches focus on the structural similarity of the examples and ignore the richer context present in the data. As a result, they typically produce queries that are too general, and fail to capture the user’s intent effectively. In this article, we present SQuID, a system that performs semantic-similarity-aware query intent discovery from user-provided example tuples.
Our work makes the following contributions: (1) We design SQuID: an end-to-end system that automatically formulates select-project-join queries with optional group-by aggregation and intersection operators – a much larger class than what prior QBE techniques support – from user-provided examples, in an open-world setting. (2) We express the problem of query intent discovery using a probabilistic abduction model that infers a query as the most likely explanation of the provided examples. (3) We introduce the notion of an abduction-ready database, which precomputes semantic properties and related statistics, allowing SQuID to achieve real-time performance. (4) We present an extensive empirical evaluation on three real-world datasets, including user intent case studies, demonstrating that SQuID is efficient and effective, and outperforms machine learning methods, as well as the state of the art in the related query reverse engineering problem. (5) We contrast SQuID with traditional SQL querying through a comparative user study, which demonstrates that users with varying expertise are significantly more effective and efficient with SQuID than SQL. We find that SQuID eliminates the barriers in studying the database schema, formalizing task semantics, and writing syntactically correct SQL queries, and, thus, substantially alleviates the need for technical expertise in data exploration.
{"title":"Example-driven semantic-similarity-aware query intent discovery: Empowering users to cross the SQL barrier through query by example","authors":"Anna Fariha , Lucy Cousins , Narges Mahyar , Alexandra Meliou","doi":"10.1016/j.is.2026.102687","DOIUrl":"10.1016/j.is.2026.102687","url":null,"abstract":"<div><div>Traditional relational data interfaces require precise structured queries over potentially complex schemas. These rigid data retrieval mechanisms pose hurdles for nonexpert users, who typically lack programming language expertise and are unfamiliar with the details of the schema. Existing tools assist in formulating queries through keyword search, query recommendation, and query auto-completion, but still require some technical expertise. An alternative method for accessing data is <em>query by example</em> (QBE), where users express their data exploration intent simply by providing examples of their intended data and the system infers the intended query. However, existing QBE approaches focus on the structural similarity of the examples and ignore the richer context present in the data. As a result, they typically produce queries that are too general, and fail to capture the user’s intent effectively. In this article, we present <span>SQuID</span>, a system that performs <em>semantic-similarity-aware</em> query intent discovery from user-provided example tuples.</div><div>Our work makes the following contributions: (1) We design <span>SQuID</span>: an end-to-end system that automatically formulates select-project-join queries with optional group-by aggregation and intersection operators – a much larger class than what prior QBE techniques support – from user-provided examples, in an open-world setting. (2) We express the problem of query intent discovery using a <em>probabilistic abduction model</em> that infers a query as the most likely explanation of the provided examples. (3) We introduce the notion of an <em>abduction-ready</em> database, which precomputes semantic properties and related statistics, allowing <span>SQuID</span> to achieve real-time performance. (4) We present an extensive empirical evaluation on three real-world datasets, including user intent case studies, demonstrating that <span>SQuID</span> is efficient and effective, and outperforms machine learning methods, as well as the state of the art in the related query reverse engineering problem. (5) We contrast <span>SQuID</span> with traditional <span>SQL</span> querying through a comparative user study, which demonstrates that users with varying expertise are significantly more effective and efficient with <span>SQuID</span> than <span>SQL</span>. We find that <span>SQuID</span> eliminates the barriers in studying the database schema, formalizing task semantics, and writing syntactically correct <span>SQL</span> queries, and, thus, substantially alleviates the need for technical expertise in data exploration.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"138 ","pages":"Article 102687"},"PeriodicalIF":3.4,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent LLM-based approaches have achieved impressive results on Text-to-SQL benchmarks such as Spider and Bird. However, these benchmarks do not accurately reflect the complexity typically encountered in real-world enterprise scenarios, where queries often span multiple tables. In this paper, we introduce HLR-SQL, a new approach designed to handle such complex enterprise SQL queries. Unlike existing methods, HLR-SQL imitates Human-Like Reasoning with LLMs by incrementally composing queries through a sequence of intermediate steps, gradually building up to the full query. This is an extended version of Eckmann et al. (2025). The new contributions are centered around incorporating human feedback directly into the reasoning process of HLR-SQL. We evaluate HLR-SQL on a newly constructed benchmark, Spider-HJ, which systematically increases query complexity by splitting tables in the original Spider dataset to raise the average join count needed by queries. Our experiments show that state-of-the-art models experience up to a 70% drop in execution accuracy on Spider-HJ, while HLR-SQL achieves a 9.51% improvement over the best existing approaches on the Spider leaderboard. Finally, we extended HLR-SQL to incorporate human feedback directly into the reasoning process by allowing the LLM to selectively ask for human help when faced with ambiguity or execution errors. We demonstrate that including the human in the loop in this way yields significantly higher accuracy, particularly for complex queries.
最近基于llm的方法在诸如Spider和Bird之类的Text-to-SQL基准测试中取得了令人印象深刻的结果。然而,这些基准测试并不能准确地反映实际企业场景中通常遇到的复杂性,在实际企业场景中,查询通常跨越多个表。在本文中,我们介绍了HLR-SQL,一种用于处理此类复杂企业SQL查询的新方法。与现有方法不同,HLR-SQL通过一系列中间步骤逐步组合查询,逐步构建完整的查询,从而模仿llm的类人推理。这是Eckmann et al.(2025)的扩展版本。新的贡献集中在将人类反馈直接集成到HLR-SQL的推理过程中。我们在新构建的基准Spider- hj上评估了HLR-SQL,该基准通过拆分原始Spider数据集中的表来提高查询所需的平均连接计数,从而系统地增加了查询复杂性。我们的实验表明,最先进的模型在Spider- hj上的执行精度下降了70%,而HLR-SQL在Spider排行榜上比现有的最佳方法提高了9.51%。最后,我们扩展了HLR-SQL,允许LLM在遇到歧义或执行错误时选择性地寻求人工帮助,从而将人工反馈直接纳入推理过程。我们证明,以这种方式将人包含在循环中会产生更高的准确性,特别是对于复杂的查询。
{"title":"HLR-SQL: Human-like reasoning for Text-to-SQL with the human in the loop","authors":"Timo Eckmann , Matthias Urban , Jan-Micha Bodensohn , Carsten Binnig","doi":"10.1016/j.is.2025.102670","DOIUrl":"10.1016/j.is.2025.102670","url":null,"abstract":"<div><div>Recent LLM-based approaches have achieved impressive results on Text-to-SQL benchmarks such as Spider and Bird. However, these benchmarks do not accurately reflect the complexity typically encountered in real-world enterprise scenarios, where queries often span multiple tables. In this paper, we introduce HLR-SQL, a new approach designed to handle such complex enterprise SQL queries. Unlike existing methods, HLR-SQL imitates <u>H</u>uman-<u>L</u>ike <u>R</u>easoning with LLMs by incrementally composing queries through a sequence of intermediate steps, gradually building up to the full query. This is an extended version of Eckmann et al. (2025). The new contributions are centered around incorporating human feedback directly into the reasoning process of HLR-SQL. We evaluate HLR-SQL on a newly constructed benchmark, Spider-HJ, which systematically increases query complexity by splitting tables in the original Spider dataset to raise the average join count needed by queries. Our experiments show that state-of-the-art models experience up to a 70% drop in execution accuracy on Spider-HJ, while HLR-SQL achieves a 9.51% improvement over the best existing approaches on the Spider leaderboard. Finally, we extended HLR-SQL to incorporate human feedback directly into the reasoning process by allowing the LLM to selectively ask for human help when faced with ambiguity or execution errors. We demonstrate that including the human in the loop in this way yields significantly higher accuracy, particularly for complex queries.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"138 ","pages":"Article 102670"},"PeriodicalIF":3.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feature attribution techniques are crucial for interpreting machine learning models, but practitioners often face difficulties to understand how different methods compare and which one best fits their analytical goals. This difficulty arises from inconsistent results across methods, evaluation metrics that emphasize distinct and sometimes conflicting properties, and subjective preferences that influence how explanation quality is interpreted. In this paper, we introduce Explainalytics, an open-source Python library that transforms this challenging decision-making process into an evidence-based visual analytics workflow. Explainalytics calculates a range of evaluation metrics and presents the results through five coordinated views spanning global to local analysis. Linked filtering, dynamic updates, and brushing allow users to pivot fluidly between global trends and local details, supporting exploratory sense-making rather than rigid pipelines. In a within-subject laboratory study with 10 machine learning practitioners, we compared Explainalytics against a baseline. Explainalytics users experienced significantly lower cognitive workload and higher perceived usability.
{"title":"A visualization-driven decision support system for selecting feature attribution methods","authors":"Priscylla Silva , Evandro Ortigossa , Dishita Turakhia , Claudio Silva , Luis Gustavo Nonato","doi":"10.1016/j.is.2025.102661","DOIUrl":"10.1016/j.is.2025.102661","url":null,"abstract":"<div><div>Feature attribution techniques are crucial for interpreting machine learning models, but practitioners often face difficulties to understand how different methods compare and which one best fits their analytical goals. This difficulty arises from inconsistent results across methods, evaluation metrics that emphasize distinct and sometimes conflicting properties, and subjective preferences that influence how explanation quality is interpreted. In this paper, we introduce Explainalytics, an open-source Python library that transforms this challenging decision-making process into an evidence-based visual analytics workflow. Explainalytics calculates a range of evaluation metrics and presents the results through five coordinated views spanning global to local analysis. Linked filtering, dynamic updates, and brushing allow users to pivot fluidly between global trends and local details, supporting exploratory sense-making rather than rigid pipelines. In a within-subject laboratory study with 10 machine learning practitioners, we compared Explainalytics against a baseline. Explainalytics users experienced significantly lower cognitive workload and higher perceived usability.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"138 ","pages":"Article 102661"},"PeriodicalIF":3.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145977429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}