Pub Date : 2026-01-23DOI: 10.1016/j.jss.2026.112794
Wen Zhang , Jinfu Chen , Saihua Cai , Kun Wang , Yisong Liu , Haotong Ding
Coverage-guided Greybox Fuzzing (CGF) aims to maximize code area exploration within limited time, achieving higher code coverage. Current methods generally estimate seed potential through attributes like execution speed and size, but often ignore the distribution of explored program space and seed category potential in detecting new coverage, resulting in unbalanced code area exploration and limited detection of complex code. This paper proposes TMS-Fuzz, a new fuzzing seed scheduling method that balances code area exploration by distinguishing execution coverage features of seed inputs. By computing the path similarity between the execution coverage of different seed inputs, TMS-Fuzz dynamically and adaptively clusters them. Additionally, to improve the return on investment (ROI) of fuzzing, TMS-Fuzz uses a customized Thompson sampling algorithm to statistically select a seed group with the highest ROI, meaning the mutations of seeds in this group are most likely to discover new unique paths and crashes. Finally, TMS-Fuzz performs fuzzing on the target program by mutating the seed files in the selected seed group. Evaluations on eight real-world programs, compared with state-of-the-art open-source fuzzers, show that TMS-Fuzz improves edge coverage and crash detection capabilities in real programs.
{"title":"A novel seed scheduling scheme using Thompson sampling for coverage-guided greybox fuzzing","authors":"Wen Zhang , Jinfu Chen , Saihua Cai , Kun Wang , Yisong Liu , Haotong Ding","doi":"10.1016/j.jss.2026.112794","DOIUrl":"10.1016/j.jss.2026.112794","url":null,"abstract":"<div><div>Coverage-guided Greybox Fuzzing (CGF) aims to maximize code area exploration within limited time, achieving higher code coverage. Current methods generally estimate seed potential through attributes like execution speed and size, but often ignore the distribution of explored program space and seed category potential in detecting new coverage, resulting in unbalanced code area exploration and limited detection of complex code. This paper proposes TMS-Fuzz, a new fuzzing seed scheduling method that balances code area exploration by distinguishing execution coverage features of seed inputs. By computing the path similarity between the execution coverage of different seed inputs, TMS-Fuzz dynamically and adaptively clusters them. Additionally, to improve the return on investment (ROI) of fuzzing, TMS-Fuzz uses a customized Thompson sampling algorithm to statistically select a seed group with the highest ROI, meaning the mutations of seeds in this group are most likely to discover new unique paths and crashes. Finally, TMS-Fuzz performs fuzzing on the target program by mutating the seed files in the selected seed group. Evaluations on eight real-world programs, compared with state-of-the-art open-source fuzzers, show that TMS-Fuzz improves edge coverage and crash detection capabilities in real programs.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112794"},"PeriodicalIF":4.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1016/j.jss.2026.112797
Jianwei Wu , James Clause
Lightweight static code analysis tools (linters) are commonly used to inspect complex code, locate format violations, detect software vulnerabilities, and fix bugs. However, developers often lack a good understanding of the capabilities of linters for newer languages like Golang. In this paper, we evaluated existing Go linters by surveying professional developers about real-world issues in the industrial workflow at MathWorks. Because of the early adoption of Go linters, we continued to observe issues that disrupted our development workflow. This paper presents our practical experience with Go linters, highlighting specific issues that often escaped detection and the consequences of these gaps. The results of the evaluation show that the linters are often unable to detect issues and, even when they are able to, they are insufficient to guide developers to valid solutions. These results provide a better understanding of the capabilities of Go linters and facilitate the development of better tools in the future.
{"title":"An empirical assessment of go linters on real-world issues","authors":"Jianwei Wu , James Clause","doi":"10.1016/j.jss.2026.112797","DOIUrl":"10.1016/j.jss.2026.112797","url":null,"abstract":"<div><div>Lightweight static code analysis tools (linters) are commonly used to inspect complex code, locate format violations, detect software vulnerabilities, and fix bugs. However, developers often lack a good understanding of the capabilities of linters for newer languages like Golang. In this paper, we evaluated existing Go linters by surveying professional developers about real-world issues in the industrial workflow at MathWorks. Because of the early adoption of Go linters, we continued to observe issues that disrupted our development workflow. This paper presents our practical experience with Go linters, highlighting specific issues that often escaped detection and the consequences of these gaps. The results of the evaluation show that the linters are often unable to detect issues and, even when they are able to, they are insufficient to guide developers to valid solutions. These results provide a better understanding of the capabilities of Go linters and facilitate the development of better tools in the future.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112797"},"PeriodicalIF":4.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1016/j.jss.2026.112787
Carlos J. Fernandez-Candel , Anthony Cleve , Jesus J. Garcia-Molina
Most NoSQL systems adopt a schema-on-read approach to promote flexibility and agility: the structure of the stored data is not constrained by predefined schemas. However, the absence of explicit schema declarations does not imply the absence of schemas themselves. In practice, schemas are implicit in both the application code and the stored data, and are essential for building tools such as data modelers, query optimizers, data migrators, or for performing database refactorings. As a result, NoSQL schema inference (also known as schema extraction or discovery) has gained attention from the database community, with most approaches focusing on extracting schemas from data. In contrast, the source code analysis remains less explored for this purpose.
In this paper, we present a static code analysis strategy to extract logical schemas from NoSQL applications. Our solution is based on a model-driven reverse engineering process composed of a chain of platform-independent model transformations. The extracted schema conforms to the U-Schema unified metamodel, which can represent both NoSQL and relational schemas. To support this process, we define a metamodel capable of representing the core elements of object-oriented languages. Application code is first injected into a code model, from which a control flow model is derived. This, in turn, enables the generation of a model representing both data access operations and the structure of stored data. From these models, the U-Schema logical schema is inferred. Additionally, the extracted information can be used to identify refactoring opportunities. We illustrate this capability through the detection of join-like query patterns and the automated application of field duplication strategies to eliminate expensive joins. All stages of the process are described in detail, and the approach is validated through a round-trip experiment in which a application using a MongoDB store is automatically generated from a predefined schema. The inferred schema is then compared to the original to assess the accuracy of the extraction process.
{"title":"Towards the automated extraction and refactoring of NoSQL schemas from application code","authors":"Carlos J. Fernandez-Candel , Anthony Cleve , Jesus J. Garcia-Molina","doi":"10.1016/j.jss.2026.112787","DOIUrl":"10.1016/j.jss.2026.112787","url":null,"abstract":"<div><div>Most NoSQL systems adopt a schema-on-read approach to promote flexibility and agility: the structure of the stored data is not constrained by predefined schemas. However, the absence of explicit schema declarations does not imply the absence of schemas themselves. In practice, schemas are implicit in both the application code and the stored data, and are essential for building tools such as data modelers, query optimizers, data migrators, or for performing database refactorings. As a result, NoSQL schema inference (also known as schema extraction or discovery) has gained attention from the database community, with most approaches focusing on extracting schemas from data. In contrast, the source code analysis remains less explored for this purpose.</div><div>In this paper, we present a static code analysis strategy to extract logical schemas from NoSQL applications. Our solution is based on a model-driven reverse engineering process composed of a chain of platform-independent model transformations. The extracted schema conforms to the U-Schema unified metamodel, which can represent both NoSQL and relational schemas. To support this process, we define a metamodel capable of representing the core elements of object-oriented languages. Application code is first injected into a code model, from which a control flow model is derived. This, in turn, enables the generation of a model representing both data access operations and the structure of stored data. From these models, the U-Schema logical schema is inferred. Additionally, the extracted information can be used to identify refactoring opportunities. We illustrate this capability through the detection of join-like query patterns and the automated application of field duplication strategies to eliminate expensive joins. All stages of the process are described in detail, and the approach is validated through a round-trip experiment in which a application using a MongoDB store is automatically generated from a predefined schema. The inferred schema is then compared to the original to assess the accuracy of the extraction process.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112787"},"PeriodicalIF":4.1,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1016/j.jss.2026.112795
Shuning Ge , Fangyun Qin , Xiaohui Wan , Yang Liu , Qian Dai , Zheng Zheng
Software systems that run for long periods often suffer from software aging, which is typically caused by Aging-Related Bugs (ARBs). To mitigate the risk of ARBs early in the development phase, ARB prediction has been introduced into software aging research. However, due to the difficulty of collecting ARBs, within-project ARB prediction faces the challenge of data scarcity, leading to the proposal of cross-project ARB prediction. This task faces two major challenges: 1) domain adaptation issue caused by distribution difference between source and target projects; and 2) severe class imbalance between ARB-prone and ARB-free samples. Although various methods have been proposed for cross-project ARB prediction, existing approaches treat the input metrics independently and often neglect the rich inter-metric dependencies, which can lead to overlapping information and misjudgment of metric importance, potentially affecting the model’s performance. Moreover, they typically use cross-entropy as the loss function during training, which cannot distinguish the difficulty of sample classification. To overcome these limitations, we propose ARFT-Transformer, a transformer-based cross-project ARB prediction framework that introduces a metric-level multi-head attention mechanism to capture metric interactions and incorporates Focal Loss function to effectively handle class imbalance. Experiments conducted on three large-scale open-source projects demonstrate that ARFT-Transformer on average outperforms state-of-the-art cross-project ARB prediction methods in both single-source and multi-source cases, achieving up to a 29.54% and 19.92% improvement in Balance metric.
{"title":"ARFT-Transformer: Modeling metric dependencies for cross-project aging-related bug prediction","authors":"Shuning Ge , Fangyun Qin , Xiaohui Wan , Yang Liu , Qian Dai , Zheng Zheng","doi":"10.1016/j.jss.2026.112795","DOIUrl":"10.1016/j.jss.2026.112795","url":null,"abstract":"<div><div>Software systems that run for long periods often suffer from software aging, which is typically caused by Aging-Related Bugs (ARBs). To mitigate the risk of ARBs early in the development phase, ARB prediction has been introduced into software aging research. However, due to the difficulty of collecting ARBs, within-project ARB prediction faces the challenge of data scarcity, leading to the proposal of cross-project ARB prediction. This task faces two major challenges: 1) domain adaptation issue caused by distribution difference between source and target projects; and 2) severe class imbalance between ARB-prone and ARB-free samples. Although various methods have been proposed for cross-project ARB prediction, existing approaches treat the input metrics independently and often neglect the rich inter-metric dependencies, which can lead to overlapping information and misjudgment of metric importance, potentially affecting the model’s performance. Moreover, they typically use cross-entropy as the loss function during training, which cannot distinguish the difficulty of sample classification. To overcome these limitations, we propose ARFT-Transformer, a transformer-based cross-project ARB prediction framework that introduces a metric-level multi-head attention mechanism to capture metric interactions and incorporates Focal Loss function to effectively handle class imbalance. Experiments conducted on three large-scale open-source projects demonstrate that ARFT-Transformer on average outperforms state-of-the-art cross-project ARB prediction methods in both single-source and multi-source cases, achieving up to a 29.54% and 19.92% improvement in Balance metric.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112795"},"PeriodicalIF":4.1,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1016/j.jss.2026.112792
Muhammad Umar Zeshan, Motunrayo Ibiyo, Claudio Di Sipio, Phuong T. Nguyen, Davide Di Ruscio
Malicious code in open-source repositories such as PyPI poses a growing threat to software supply chains. Traditional rule-based tools often overlook the semantic patterns in source code that are crucial for identifying adversarial components. Large language models (LLMs) show promise for software analysis, yet their use in interpretable and modular security pipelines remains limited.
This paper presents LAMPS, a multi-agent system that employs collaborative LLMs to detect malicious PyPI packages. The system consists of four role-specific agents for package retrieval, file extraction, classification, and verdict aggregation, coordinated through the CrewAI framework. A prototype combines a fine-tuned CodeBERT model for classification with LLaMA 3 agents for contextual reasoning. LAMPS has been evaluated on two complementary datasets: D1, a balanced collection of 6000 setup.py files, and D2, a realistic multi-file dataset with 1296 files and natural class imbalance. On D1, LAMPS achieves 97.7% accuracy, surpassing MPHunter and TD-IDF stacking models–two state-of-the-art approaches. On D2, it reaches 99.5% accuracy and 99.5% balanced accuracy, outperforming RAG-based approaches and fine-tuned single-agent baselines. McNemar’s test confirmed these improvements as highly significant. The results demonstrate the feasibility of distributed LLM reasoning for malicious code detection and highlight the benefits of modular multi-agent designs in software supply chain security.
{"title":"Many hands make light work: An LLM-based multi-agent system for detecting malicious PyPI packages","authors":"Muhammad Umar Zeshan, Motunrayo Ibiyo, Claudio Di Sipio, Phuong T. Nguyen, Davide Di Ruscio","doi":"10.1016/j.jss.2026.112792","DOIUrl":"10.1016/j.jss.2026.112792","url":null,"abstract":"<div><div>Malicious code in open-source repositories such as PyPI poses a growing threat to software supply chains. Traditional rule-based tools often overlook the semantic patterns in source code that are crucial for identifying adversarial components. Large language models (LLMs) show promise for software analysis, yet their use in interpretable and modular security pipelines remains limited.</div><div>This paper presents <span>LAMPS</span>, a multi-agent system that employs collaborative LLMs to detect malicious PyPI packages. The system consists of four role-specific agents for <em>package retrieval, file extraction, classification</em>, and <em>verdict aggregation</em>, coordinated through the CrewAI framework. A prototype combines a fine-tuned CodeBERT model for classification with LLaMA 3 agents for contextual reasoning. <span>LAMPS</span> has been evaluated on two complementary datasets: D<sub>1</sub>, a balanced collection of 6000 <span>setup.py</span> files, and D<sub>2</sub>, a realistic multi-file dataset with 1296 files and natural class imbalance. On D<sub>1</sub>, <span>LAMPS</span> achieves 97.7% accuracy, surpassing <span>MPHunter</span> and TD-IDF stacking models–two state-of-the-art approaches. On D<sub>2</sub>, it reaches 99.5% accuracy and 99.5% balanced accuracy, outperforming RAG-based approaches and fine-tuned single-agent baselines. McNemar’s test confirmed these improvements as highly significant. The results demonstrate the feasibility of distributed LLM reasoning for malicious code detection and highlight the benefits of modular multi-agent designs in software supply chain security.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112792"},"PeriodicalIF":4.1,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146039476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1016/j.jss.2026.112785
Tarannum Shaila Zaman , Chadni Islam , Jiangfan Shi , Zihan Shi , Fiona Xian , Tingting Yu
Reproducing system-level concurrency bugs requires both input data and the precise interleaving order of system calls. This process is challenging because such bugs are non-deterministic, and bug reports often lack the detailed information needed. Additionally, the unstructured nature of reports written in natural language makes it difficult to extract necessary details. Existing tools are inadequate to reproduce these bugs due to their inability to manage the specific interleaving at the system call level. To address these challenges, we propose SysPro, a novel approach that automatically extracts relevant system call names from bug reports and identifies their locations in the source code. It generates input data by utilizing information retrieval, regular expression matching, and the category-partition method. This extracted input and interleaving data are then used to reproduce bugs through dynamic source code instrumentation. Our empirical study on real-world benchmarks demonstrates that SysPro is both effective and efficient at localizing and reproducing system-level concurrency bugs from bug reports.
{"title":"SysPro: Reproducing system-level concurrency bugs from bug reports","authors":"Tarannum Shaila Zaman , Chadni Islam , Jiangfan Shi , Zihan Shi , Fiona Xian , Tingting Yu","doi":"10.1016/j.jss.2026.112785","DOIUrl":"10.1016/j.jss.2026.112785","url":null,"abstract":"<div><div>Reproducing system-level concurrency bugs requires both input data and the precise interleaving order of system calls. This process is challenging because such bugs are non-deterministic, and bug reports often lack the detailed information needed. Additionally, the unstructured nature of reports written in natural language makes it difficult to extract necessary details. Existing tools are inadequate to reproduce these bugs due to their inability to manage the specific interleaving at the system call level. To address these challenges, we propose SysPro, a novel approach that automatically extracts relevant system call names from bug reports and identifies their locations in the source code. It generates input data by utilizing information retrieval, regular expression matching, and the category-partition method. This extracted input and interleaving data are then used to reproduce bugs through dynamic source code instrumentation. Our empirical study on real-world benchmarks demonstrates that SysPro is both effective and efficient at localizing and reproducing system-level concurrency bugs from bug reports.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112785"},"PeriodicalIF":4.1,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ability for students to effectively adapt to new programming languages is a desirable skill that is useful for careers demanding rapid adoption of different languages. This study aims to measure the cognitive load required to generalize programming skills from one language to another across three different tasks: code comprehension, syntactic debugging, and semantic debugging. Participants with basic background in either Java or Python (but not both) were asked to explain a given code segment or identify the syntactic/semantic bug for code written in Python. The cognitive load (i.e., mental effort) to tackle the three tasks in Python for Java-trained students is then measured by employing eye-tracking technology and compared against Python-trained students to determine the overhead in processing these tasks. Our results show that the difference in cognitive load between Java and Python students was more significant when focusing on conditional or iterative constructs compared to other statements in the code. These findings suggest that certain code elements require more effort than others when trying to understand code in a new language, guiding educators toward focusing more on those challenging areas when instructing students with existing knowledge in a different programming language.
{"title":"An empirical eye-tracking study of cross-lingual program comprehension and debugging","authors":"Ameer Mohammed, Reem Albaghli, Hanaa Alrushood, Fatme Ghaddar","doi":"10.1016/j.jss.2026.112793","DOIUrl":"10.1016/j.jss.2026.112793","url":null,"abstract":"<div><div>The ability for students to effectively adapt to new programming languages is a desirable skill that is useful for careers demanding rapid adoption of different languages. This study aims to measure the cognitive load required to generalize programming skills from one language to another across three different tasks: code comprehension, syntactic debugging, and semantic debugging. Participants with basic background in either Java or Python (but not both) were asked to explain a given code segment or identify the syntactic/semantic bug for code written in Python. The cognitive load (i.e., mental effort) to tackle the three tasks in Python for Java-trained students is then measured by employing eye-tracking technology and compared against Python-trained students to determine the overhead in processing these tasks. Our results show that the difference in cognitive load between Java and Python students was more significant when focusing on conditional or iterative constructs compared to other statements in the code. These findings suggest that certain code elements require more effort than others when trying to understand code in a new language, guiding educators toward focusing more on those challenging areas when instructing students with existing knowledge in a different programming language.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112793"},"PeriodicalIF":4.1,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146039477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1016/j.jss.2026.112791
Ömer Özdemir , Reyhan Aydoğan , Hasan Sözer
Continuous Integration (CI) is a development practice where developers regularly merge their code changes into a central repository, enabling simultaneous collaboration across a shared codebase. This frequent integration and automated building process in CI helps to detect and resolve conflicts or errors early in development. However, in large-scale systems, the build process can be costly. Each build incurs expenses, while skipping builds can increase the risk of undetected failures. Accurate predictions can help to identify builds that can be safely skipped to reduce CI costs. This paper presents an empirical study within an industrial setting, investigating the use of machine learning techniques to predict build failures after a set of collective changes. Unlike many existing works that apply random data splitting, our results show that chronological (time-based) splitting offers a more realistic and reliable assessment of model performance in CI environments. We evaluate various models and feature combinations on a dataset derived from real-world industrial projects. We observe high precision but low recall in predicting failed builds, allowing hundreds of successful builds to be correctly skipped, with around a dozen failures potentially being missed. Our analysis shows that this yields substantial time savings of approximately 2.5 h per build on average, while missed failures necessarily result in delayed failure detection, whose practical impact depends on application criticality and operational context.
{"title":"On the use of machine learning for failure prediction after collective changes in automated continuous integration testing","authors":"Ömer Özdemir , Reyhan Aydoğan , Hasan Sözer","doi":"10.1016/j.jss.2026.112791","DOIUrl":"10.1016/j.jss.2026.112791","url":null,"abstract":"<div><div>Continuous Integration (CI) is a development practice where developers regularly merge their code changes into a central repository, enabling simultaneous collaboration across a shared codebase. This frequent integration and automated building process in CI helps to detect and resolve conflicts or errors early in development. However, in large-scale systems, the build process can be costly. Each build incurs expenses, while skipping builds can increase the risk of undetected failures. Accurate predictions can help to identify builds that can be safely skipped to reduce CI costs. This paper presents an empirical study within an industrial setting, investigating the use of machine learning techniques to predict build failures after a set of collective changes. Unlike many existing works that apply random data splitting, our results show that chronological (time-based) splitting offers a more realistic and reliable assessment of model performance in CI environments. We evaluate various models and feature combinations on a dataset derived from real-world industrial projects. We observe high precision but low recall in predicting failed builds, allowing hundreds of successful builds to be correctly skipped, with around a dozen failures potentially being missed. Our analysis shows that this yields substantial time savings of approximately 2.5 h per build on average, while missed failures necessarily result in delayed failure detection, whose practical impact depends on application criticality and operational context.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112791"},"PeriodicalIF":4.1,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-18DOI: 10.1016/j.jss.2026.112790
Godfried B Adaba
Hybrid project management is becoming a dominant delivery mode in software engineering, yet the mechanisms through which organisations enact and sustain hybrid practices remain insufficiently theorised. Existing accounts often imply a linear or prescriptive integration of governance and agile methods, overlooking the negotiated, context-dependent nature of hybrid work. This study advances a process-based explanation of hybrid delivery by developing the grounded theory of contingent hybridity, derived through a Constructivist Grounded Theory (CGT) study within a multinational IT firm. Drawing on interviews, observations, and project artefacts, the findings show that hybridisation is not simply the coexistence of plan-driven project governance and agile routines, but an emergent socio-technical process shaped by practitioners’ interpretive work and situated adaptation. Four interdependent mechanisms structure this process: structural anchoring, through which governance frameworks provide stability and legitimacy; adaptive enactment, whereby agile practices are tailored and embedded within formal controls; boundary work, involving translators and hybrid ceremonies that reconcile divergent organisational logics; and role hybridisation, in which practitioners fluidly shift between control-oriented and delivery-focused responsibilities. The analysis demonstrates that hybrid practices vary across roles and project phases, with effective integration depending less on adherence to prescribed templates and more on ongoing, context-sensitive negotiation. These insights refine theoretical understandings of hybrid project management by moving beyond static typologies toward a dynamic, practice-centred perspective and offer actionable guidance for organisations seeking to balance agility and control in complex, regulated environments.
{"title":"The pragmatics of hybridity: A grounded theory of method integration in software engineering projects","authors":"Godfried B Adaba","doi":"10.1016/j.jss.2026.112790","DOIUrl":"10.1016/j.jss.2026.112790","url":null,"abstract":"<div><div>Hybrid project management is becoming a dominant delivery mode in software engineering, yet the mechanisms through which organisations enact and sustain hybrid practices remain insufficiently theorised. Existing accounts often imply a linear or prescriptive integration of governance and agile methods, overlooking the negotiated, context-dependent nature of hybrid work. This study advances a process-based explanation of hybrid delivery by developing the grounded theory of contingent hybridity, derived through a Constructivist Grounded Theory (CGT) study within a multinational IT firm. Drawing on interviews, observations, and project artefacts, the findings show that hybridisation is not simply the coexistence of plan-driven project governance and agile routines, but an emergent socio-technical process shaped by practitioners’ interpretive work and situated adaptation. Four interdependent mechanisms structure this process: structural anchoring, through which governance frameworks provide stability and legitimacy; adaptive enactment, whereby agile practices are tailored and embedded within formal controls; boundary work, involving translators and hybrid ceremonies that reconcile divergent organisational logics; and role hybridisation, in which practitioners fluidly shift between control-oriented and delivery-focused responsibilities. The analysis demonstrates that hybrid practices vary across roles and project phases, with effective integration depending less on adherence to prescribed templates and more on ongoing, context-sensitive negotiation. These insights refine theoretical understandings of hybrid project management by moving beyond static typologies toward a dynamic, practice-centred perspective and offer actionable guidance for organisations seeking to balance agility and control in complex, regulated environments.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112790"},"PeriodicalIF":4.1,"publicationDate":"2026-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146038340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In software reliability, practitioners demand to estimate software reliability measures accurately from software fault-count data, for making the release decision and project management. To achieve these objectives, software fault-count processes are often described using software reliability models (SRMs) based on stochastic counting processes like non-homogeneous Poisson processes (NHPPs), and statistical point estimation of model parameters is carried out. Substituting the point estimates of model parameters into several software reliability measures, one gets the point estimates of desired reliability measures. However, since such point estimators tend to have high variances, the resulting release decision and project management plans are not reliable under uncertainty. Then, interval estimation of software reliability measures is expected to realize more robust decision making, but is quite difficult to obtain the analytical confidence regions. Bootstrap is a statistical method that generates realizations of statistical estimators by resampling fault-count data. It allows us to evaluate the statistical properties of software reliability measures under uncertainty. In this paper, we propose a fine-grained parametric bootstrap method for NHPP-based SRMs, where a thinning-like resampling algorithm is employed instead of intuitive resampling algorithms which generate the bootstrap data with ties problem. We compare our thinning-like resampling algorithm with the existing ones in both Monte Carlo simulation and empirical study. It can be shown that the model parameters and their associated software reliability measures estimated by our fine-grained parametric bootstrap method are more accurate and robust than the other bootstrap algorithms.
{"title":"A Fine-grained parametric bootstrap approach for NHPP-based software reliability modeling","authors":"Jingchi Wu, Tadashi Dohi, Junjun Zheng, Hiroyuki Okamura","doi":"10.1016/j.jss.2026.112789","DOIUrl":"10.1016/j.jss.2026.112789","url":null,"abstract":"<div><div>In software reliability, practitioners demand to estimate software reliability measures accurately from software fault-count data, for making the release decision and project management. To achieve these objectives, software fault-count processes are often described using software reliability models (SRMs) based on stochastic counting processes like non-homogeneous Poisson processes (NHPPs), and statistical point estimation of model parameters is carried out. Substituting the point estimates of model parameters into several software reliability measures, one gets the point estimates of desired reliability measures. However, since such point estimators tend to have high variances, the resulting release decision and project management plans are not reliable under uncertainty. Then, interval estimation of software reliability measures is expected to realize more robust decision making, but is quite difficult to obtain the analytical confidence regions. Bootstrap is a statistical method that generates realizations of statistical estimators by resampling fault-count data. It allows us to evaluate the statistical properties of software reliability measures under uncertainty. In this paper, we propose a fine-grained parametric bootstrap method for NHPP-based SRMs, where a thinning-like resampling algorithm is employed instead of intuitive resampling algorithms which generate the bootstrap data with ties problem. We compare our thinning-like resampling algorithm with the existing ones in both Monte Carlo simulation and empirical study. It can be shown that the model parameters and their associated software reliability measures estimated by our fine-grained parametric bootstrap method are more accurate and robust than the other bootstrap algorithms.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"236 ","pages":"Article 112789"},"PeriodicalIF":4.1,"publicationDate":"2026-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146080860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}