Journal of Systems and Software最新文献_第4页

Exploring challenges in test mocking: Developer questions and insights from StackOverflow 探索测试模拟中的挑战：来自StackOverflow的开发人员问题和见解

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software

Pub Date : 2025-12-31 DOI: 10.1016/j.jss.2025.112748

Mumtahina Ahmed , Md Nahidul Islam Opu , Chanchal Roy , Sujana Islam Suhi , Shaiful Chowdhury

Mocking is a common unit testing technique that is used to simplify tests, reduce flakiness, and improve coverage by replacing real dependencies with simplified implementations. Despite its widespread use in Open Source Software (OSS) projects, there is limited understanding of how and why developers use mocks and the challenges they face. In this study, we have analyzed 25,302 questions related to Mocking on StackOverflow to identify the challenges faced by developers. We have used Latent Dirichlet Allocation (LDA) for topic modeling, identified 30 key topics, and grouped the topics into five key categories. Consequently, we analyzed the annual and relative probabilities of each category to understand the evolution of mocking-related discussions. Trend analysis reveals that categories such as Mocking Techniques and External Services have remained consistently dominant, highlighting evolving developer priorities and ongoing technical challenges. While the questions on Theoretical category declined after 2010, posts regarding Error Handling grew notably from 2009.

Our findings also show an inverse relationship between a topic’s popularity and its difficulty. Popular topics like Framework Selection tend to have lower difficulty and faster resolution times, while complex topics like HTTP Requests and Responses are more likely to remain unanswered and take longer to resolve. Additionally, we evaluated questions based on the answer status- successful, ordinary, or unsuccessful, and found that topics such as Framework Selection have higher success rates, whereas tool setup and Android-related issues are more often unresolved. A classification of questions into How, Why, What, and Other revealed that over 64 % are How questions, particularly in practical domains like file access, APIs, and databases, indicating a strong need for implementation guidance. Why questions are more prevalent in error-handling contexts, reflecting conceptual challenges in debugging, while What questions are rare and mostly tied to theoretical discussions. These insights offer valuable guidance for improving developer support, tooling, and educational content in the context of mocking and unit testing.

mock是一种常用的单元测试技术，用于简化测试，减少不稳定，并通过用简化的实现取代实际的依赖关系来提高覆盖率。尽管在开源软件（OSS）项目中广泛使用mock，但是对于开发人员如何以及为什么使用mock以及他们面临的挑战，人们的理解是有限的。在这项研究中，我们分析了25,302个与调侃StackOverflow相关的问题，以确定开发人员面临的挑战。我们使用潜在狄利克雷分配（Latent Dirichlet Allocation， LDA）进行主题建模，确定了30个关键主题，并将主题分为五个关键类别。因此，我们分析了每个类别的年度和相对概率，以了解嘲弄相关讨论的演变。趋势分析显示，诸如模拟技术和外部服务之类的类别一直占据主导地位，突出了不断发展的开发人员优先级和正在进行的技术挑战。虽然理论类的问题在2010年后有所下降，但关于错误处理的帖子从2009年开始显著增加。我们的研究结果还表明，一个话题的受欢迎程度与其难度成反比关系。像框架选择这样的热门主题往往具有较低的难度和更快的解决时间，而像HTTP请求和响应这样的复杂主题更有可能没有答案，需要更长的时间才能解决。此外，我们根据答案状态（成功、普通或不成功）对问题进行了评估，发现框架选择等主题的成功率更高，而工具设置和android相关问题往往无法解决。将问题分类为How、Why、What和Other，结果显示，超过64% %的问题是How，特别是在文件访问、api和数据库等实际领域，这表明迫切需要实现指南。为什么问题在错误处理上下文中更普遍，反映了调试中的概念性挑战，而什么问题很少，并且主要与理论讨论有关。这些见解为在模拟和单元测试的上下文中改进开发人员支持、工具和教育内容提供了有价值的指导。

{"title":"Exploring challenges in test mocking: Developer questions and insights from StackOverflow","authors":"Mumtahina Ahmed , Md Nahidul Islam Opu , Chanchal Roy , Sujana Islam Suhi , Shaiful Chowdhury","doi":"10.1016/j.jss.2025.112748","DOIUrl":"10.1016/j.jss.2025.112748","url":null,"abstract":"<div><div>Mocking is a common unit testing technique that is used to simplify tests, reduce flakiness, and improve coverage by replacing real dependencies with simplified implementations. Despite its widespread use in Open Source Software (OSS) projects, there is limited understanding of how and why developers use mocks and the challenges they face. In this study, we have analyzed 25,302 questions related to <em>Mocking</em> on StackOverflow to identify the challenges faced by developers. We have used Latent Dirichlet Allocation (LDA) for topic modeling, identified 30 key topics, and grouped the topics into five key categories. Consequently, we analyzed the annual and relative probabilities of each category to understand the evolution of mocking-related discussions. Trend analysis reveals that categories such as <em>Mocking Techniques</em> and <em>External Services</em> have remained consistently dominant, highlighting evolving developer priorities and ongoing technical challenges. While the questions on <em>Theoretical</em> category declined after 2010, posts regarding <em>Error Handling</em> grew notably from 2009.</div><div>Our findings also show an inverse relationship between a topic’s popularity and its difficulty. Popular topics like <em>Framework Selection</em> tend to have lower difficulty and faster resolution times, while complex topics like <em>HTTP Requests and Responses</em> are more likely to remain unanswered and take longer to resolve. Additionally, we evaluated questions based on the answer status- successful, ordinary, or unsuccessful, and found that topics such as <em>Framework Selection</em> have higher success rates, whereas tool setup and Android-related issues are more often unresolved. A classification of questions into <em>How, Why, What</em>, and <em>Other</em> revealed that over 64 % are <em>How</em> questions, particularly in practical domains like file access, APIs, and databases, indicating a strong need for implementation guidance. <em>Why</em> questions are more prevalent in error-handling contexts, reflecting conceptual challenges in debugging, while <em>What</em> questions are rare and mostly tied to theoretical discussions. These insights offer valuable guidance for improving developer support, tooling, and educational content in the context of mocking and unit testing.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112748"},"PeriodicalIF":4.1,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LeCov: Multi-level testing criteria for large language models LeCov：大型语言模型的多级测试标准

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software

Pub Date : 2025-12-30 DOI: 10.1016/j.jss.2025.112763

Xuan Xie , Jiayang Song , Yuheng Huang , Da Song , Felix Juefei-Xu , Lei Ma

Large Language Models (LLMs) are widely used in many different domains, but because of their limited interpretability, there are questions about how trustworthy they are in various perspectives, e.g., truthfulness and toxicity. Recent research has started developing testing methods for LLMs, aiming to uncover untrustworthy issues, i.e., defects, before deployment. However, systematic and formalized testing criteria are lacking, which hinders a comprehensive assessment of the extent and adequacy of testing exploration. To mitigate this threat, we propose a set of multi-level testing criteria, LeCov, for LLMs. The criteria consider three crucial LLM internal components, i.e., the attention mechanism, feed-forward neurons, and uncertainty, and contain nine types of testing criteria in total. We apply the criteria in two scenarios: test prioritization and coverage-guided testing. The experiment evaluation, on five models and four datasets, demonstrates the usefulness and effectiveness of LeCov.

大型语言模型（llm）广泛应用于许多不同的领域，但由于其有限的可解释性，存在关于它们在不同角度（例如，真实性和毒性）如何值得信赖的问题。最近的研究已经开始为llm开发测试方法，旨在在部署之前发现不可信的问题，即缺陷。然而，缺乏系统和形式化的测试标准，这阻碍了对测试探索的范围和充分性的全面评估。为了减轻这种威胁，我们为llm提出了一套多级测试标准LeCov。该标准考虑了三个至关重要的LLM内部组成部分，即注意机制、前馈神经元和不确定性，总共包含9种测试标准。我们在两个场景中应用这些标准：测试优先级和覆盖引导测试。在5个模型和4个数据集上进行了实验评估，验证了LeCov算法的实用性和有效性。

引用次数: 0

A reference architecture for ethical-aware autonomous systems 伦理意识自治系统的参考体系结构

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software

Pub Date : 2025-12-30 DOI: 10.1016/j.jss.2025.112749

Marco Autili , Martina De Sanctis , Paola Inverardi , Mashal Afzal Memon , Patrizio Pelliccione , Sara Pettinari

Background. Autonomous systems, whether AI-enabled or not, are ubiquitous and pervasive in our daily lives. While their adoption and use bring many benefits, they also pose significant ethical challenges. Objective. The objective of this work is to contribute a reference architecture for ethical-aware autonomous systems, focusing on their interaction and collaboration with humans, being them proactive, reactive or passive in the interaction with the systems. Method. To define the architecture, we analyzed scientific papers in the field, guidelines and recommendations, as well as laws and regulations. We then applied this acquired knowledge to build the reference architecture. The results of this work were validated through expert interviews and a scenario-based evaluation. Results. We contribute (i) a definition of ethical-aware autonomous systems, (ii) requirements for ethical-aware autonomous systems, and (iii) a reference architecture for ethical-aware autonomous systems. Our reference architecture is intended to help system and software engineers to design autonomous or intelligent systems that should interact and operate with humans in ethically sensitive contexts such as healthcare, social robotics, and assistive technologies. Conclusion. We believe that this work will assist software architects and engineers in designing and developing autonomous systems that should interact and collaborate with humans while respecting values important to individuals, society, and the environment.

背景。自主系统，无论是否启用人工智能，在我们的日常生活中无处不在。虽然它们的采用和使用带来了许多好处，但它们也带来了重大的道德挑战。目标。这项工作的目标是为道德意识自治系统提供一个参考架构，重点关注它们与人类的交互和协作，在与系统的交互中，它们是主动的，被动的或被动的。方法。为了定义体系结构，我们分析了该领域的科学论文、指导方针和建议，以及法律和法规。然后我们应用这些获得的知识来构建参考体系结构。通过专家访谈和基于场景的评估，验证了这项工作的结果。结果。我们贡献了(i)道德意识自治系统的定义，（ii）道德意识自治系统的需求，以及（iii）道德意识自治系统的参考架构。我们的参考架构旨在帮助系统和软件工程师设计自主或智能系统，这些系统应该在道德敏感的环境（如医疗保健、社交机器人和辅助技术）中与人类交互和操作。结论。我们相信，这项工作将有助于软件架构师和工程师设计和开发自主系统，这些系统应该与人类互动和协作，同时尊重对个人、社会和环境重要的价值观。

{"title":"A reference architecture for ethical-aware autonomous systems","authors":"Marco Autili , Martina De Sanctis , Paola Inverardi , Mashal Afzal Memon , Patrizio Pelliccione , Sara Pettinari","doi":"10.1016/j.jss.2025.112749","DOIUrl":"10.1016/j.jss.2025.112749","url":null,"abstract":"<div><div><em>Background.</em> Autonomous systems, whether AI-enabled or not, are ubiquitous and pervasive in our daily lives. While their adoption and use bring many benefits, they also pose significant ethical challenges. <em>Objective.</em> The objective of this work is to contribute a reference architecture for ethical-aware autonomous systems, focusing on their interaction and collaboration with humans, being them proactive, reactive or passive in the interaction with the systems. <em>Method.</em> To define the architecture, we analyzed scientific papers in the field, guidelines and recommendations, as well as laws and regulations. We then applied this acquired knowledge to build the reference architecture. The results of this work were validated through expert interviews and a scenario-based evaluation. <em>Results.</em> We contribute (i) a definition of ethical-aware autonomous systems, (ii) requirements for ethical-aware autonomous systems, and (iii) a reference architecture for ethical-aware autonomous systems. Our reference architecture is intended to help system and software engineers to design autonomous or intelligent systems that should interact and operate with humans in ethically sensitive contexts such as healthcare, social robotics, and assistive technologies. <em>Conclusion.</em> We believe that this work will assist software architects and engineers in designing and developing autonomous systems that should interact and collaborate with humans while respecting values important to individuals, society, and the environment.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112749"},"PeriodicalIF":4.1,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Inter-organizational collaborations in open-source software ecosystems 开源软件生态系统中的组织间协作

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software

Pub Date : 2025-12-30 DOI: 10.1016/j.jss.2025.112765

Roland Robert Schreiber , Thomas Wieland

Context

Open-source software (OSS) ecosystems are pivotal to modern software development by fostering innovation through inter-organizational collaboration. While becoming increasingly important, research on the patterns and dynamics of this collaboration in OSS remains limited.

Objective

This study investigates inter-organizational collaboration within different OSS ecosystems, focusing on identifying key contributing and influential organizations, analyzing collaboration patterns, and understanding their evolution over time.

Method

An exploratory case study with a mixed-methods approach was conducted, with data collected from five prominent OSS ecosystems on GitHub (React, Vue, TensorFlow, Bootstrap, and Flutter) involving 9947 developers and 339 organizations over ten years. Both qualitative and quantitative analyses were performed, including software repository mining and applying different levels (micro/macro) and dimensions (rational/structural/functional) of social network analysis (SNA) to examine collaboration patterns.

Results

Key organizations, such as Alphabet, TensorFlow, Meta, Nvidia, Intel, MobileIron, IBM, and AMD, emerged as significant contributors, acting as central hubs for collaboration within OSS ecosystems. The collaboration networks showed an initial phase of rapid growth followed by stabilization, indicating project maturation. Notably, competing firms were found to contribute to the same ecosystems, underpinning the inherently cooperative, reciprocal, and multiplex nature of OSS development.

Conclusion

The findings highlight the pivotal role of top-contributing organizations in fostering collaboration and driving innovation within OSS ecosystems. Understanding these dynamics provides valuable insights for developing strategies to enhance the effectiveness, scalability, and long-term sustainability of OSS ecosystems, benefiting both individual contributors and participating organizations.

开源软件（OSS）生态系统通过组织间协作促进创新，对现代软件开发至关重要。虽然变得越来越重要，但对OSS中这种协作的模式和动态的研究仍然有限。本研究调查了不同OSS生态系统中的组织间协作，重点是识别关键贡献和有影响力的组织，分析协作模式，并了解它们随时间的演变。方法采用混合方法进行探索性案例研究，数据收集自GitHub上五个著名的OSS生态系统（React、Vue、TensorFlow、Bootstrap和Flutter），涉及9947名开发人员和339个组织，历时10年。进行了定性和定量分析，包括软件存储库挖掘和应用社会网络分析（SNA）的不同层次（微观/宏观）和维度（理性/结构/功能）来检查协作模式。关键组织，如Alphabet、TensorFlow、Meta、Nvidia、Intel、MobileIron、IBM和AMD，成为了重要的贡献者，在OSS生态系统中充当了协作的中心枢纽。协作网络显示了一个快速增长的初始阶段，随后稳定，表明项目成熟。值得注意的是，相互竞争的公司被发现对相同的生态系统做出了贡献，支撑了OSS开发固有的合作、互惠和多重性质。研究结果强调了在OSS生态系统中促进协作和推动创新方面贡献最大的组织的关键作用。理解这些动态为开发战略提供了有价值的见解，以增强OSS生态系统的有效性、可伸缩性和长期可持续性，使个人贡献者和参与组织都受益。

{"title":"Inter-organizational collaborations in open-source software ecosystems","authors":"Roland Robert Schreiber , Thomas Wieland","doi":"10.1016/j.jss.2025.112765","DOIUrl":"10.1016/j.jss.2025.112765","url":null,"abstract":"<div><h3>Context</h3><div>Open-source software (OSS) ecosystems are pivotal to modern software development by fostering innovation through inter-organizational collaboration. While becoming increasingly important, research on the patterns and dynamics of this collaboration in OSS remains limited.</div></div><div><h3>Objective</h3><div>This study investigates inter-organizational collaboration within different OSS ecosystems, focusing on identifying key contributing and influential organizations, analyzing collaboration patterns, and understanding their evolution over time.</div></div><div><h3>Method</h3><div>An exploratory case study with a mixed-methods approach was conducted, with data collected from five prominent OSS ecosystems on GitHub (React, Vue, TensorFlow, Bootstrap, and Flutter) involving 9947 developers and 339 organizations over ten years. Both qualitative and quantitative analyses were performed, including software repository mining and applying different levels (micro/macro) and dimensions (rational/structural/functional) of social network analysis (SNA) to examine collaboration patterns.</div></div><div><h3>Results</h3><div>Key organizations, such as Alphabet, TensorFlow, Meta, Nvidia, Intel, MobileIron, IBM, and AMD, emerged as significant contributors, acting as central hubs for collaboration within OSS ecosystems. The collaboration networks showed an initial phase of rapid growth followed by stabilization, indicating project maturation. Notably, competing firms were found to contribute to the same ecosystems, underpinning the inherently cooperative, reciprocal, and multiplex nature of OSS development.</div></div><div><h3>Conclusion</h3><div>The findings highlight the pivotal role of top-contributing organizations in fostering collaboration and driving innovation within OSS ecosystems. Understanding these dynamics provides valuable insights for developing strategies to enhance the effectiveness, scalability, and long-term sustainability of OSS ecosystems, benefiting both individual contributors and participating organizations.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112765"},"PeriodicalIF":4.1,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling direct manipulation of plain-text output for template programs 允许直接操作模板程序的纯文本输出

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software

Pub Date : 2025-12-30 DOI: 10.1016/j.jss.2025.112754

Tao Zan , Xiao He , Zhenjiang Hu

In modern software development, templates are widely used to generate artifacts, enhancing productivity and consistency. However, developers often face a tedious run-modify-run loop to verify template correctness, as outputs are only inspectable after execution. This issue is exacerbated when templates involve complex logic, highlighting the need for a solution that supports both artifact generation and user-friendly modifications. We propose BIT2, a framework enabling direct manipulation of plain-text output to modify template programs. A key challenge is that plain-text output lack structural information, making it difficult to map updates back to the template. BIT2 addresses this by generating computation-structured output, preserving the underlying logic, and supporting update operations (insertion, replacement, deletion) with carefully designed fusion semantics. Our implementation, written in TypeScript (18,000 lines of code), is available as an NPM module, complemented by a VSCode plugin, LiveT, for seamless template editing. Experimental results demonstrate that BIT2 achieves efficient evaluation and supports diverse update scenarios, significantly enhancing expressiveness in template development.

在现代软件开发中，模板被广泛用于生成工件，提高生产率和一致性。然而，开发人员经常面临一个冗长的运行-修改-运行循环来验证模板的正确性，因为输出只能在执行后才可检查。当模板涉及到复杂的逻辑时，这个问题就更加严重了，这突出了对既支持工件生成又支持用户友好修改的解决方案的需求。我们提出了BIT2，一个可以直接操作纯文本输出来修改模板程序的框架。一个关键的挑战是纯文本输出缺乏结构性信息，因此很难将更新映射回模板。BIT2通过生成计算结构化的输出、保留底层逻辑以及使用精心设计的融合语义支持更新操作（插入、替换、删除）来解决这个问题。我们的实现是用TypeScript编写的（18000行代码），可以作为NPM模块使用，还可以通过VSCode插件LiveT进行无缝的模板编辑。实验结果表明，BIT2实现了高效的评估，支持多种更新场景，显著增强了模板开发的表达能力。

{"title":"Enabling direct manipulation of plain-text output for template programs","authors":"Tao Zan , Xiao He , Zhenjiang Hu","doi":"10.1016/j.jss.2025.112754","DOIUrl":"10.1016/j.jss.2025.112754","url":null,"abstract":"<div><div>In modern software development, templates are widely used to generate artifacts, enhancing productivity and consistency. However, developers often face a tedious run-modify-run loop to verify template correctness, as outputs are only inspectable after execution. This issue is exacerbated when templates involve complex logic, highlighting the need for a solution that supports both artifact generation and user-friendly modifications. We propose <span>BIT2</span>, a framework enabling direct manipulation of plain-text output to modify template programs. A key challenge is that plain-text output lack structural information, making it difficult to map updates back to the template. <span>BIT2</span> addresses this by generating computation-structured output, preserving the underlying logic, and supporting update operations (insertion, replacement, deletion) with carefully designed fusion semantics. Our implementation, written in TypeScript (18,000 lines of code), is available as an NPM module, complemented by a VSCode plugin, LiveT, for seamless template editing. Experimental results demonstrate that <span>BIT2</span> achieves efficient evaluation and supports diverse update scenarios, significantly enhancing expressiveness in template development.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112754"},"PeriodicalIF":4.1,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AI in GUI-based testing: A survey of techniques, tools, and perceived advantages and limitations 基于gui的测试中的AI：技术、工具以及感知到的优势和局限性的调查

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software

Pub Date : 2025-12-29 DOI: 10.1016/j.jss.2025.112751

Domenico Amalfitano , Riccardo Coppola , Damiano Distante , Filippo Ricca

Background: The adoption of Artificial Intelligence (AI) techniques in Software Testing (ST) has grown rapidly, particularly in response to the increasing complexity of modern systems. In GUI-based testing, AI is often cited as a promising means to automate repetitive tasks and improve testing efficiency. However, the actual use of AI in this domain remains underexplored through systematic empirical investigation.

Objective: This study aims to analyze how AI is adopted in GUI-based testing, identifying the techniques and tools employed, the testing activities they support, and the perceived benefits and limitations.

Method: We conducted a large-scale survey involving 107 participants from both academia and industry. The survey focuses on three core testing activities: test case definition, test oracle design, and test case optimization. It extends a prior study based on interviews with 45 industry practitioners.

Results:Findings show that AI is primarily used to support test case definition, with techniques such as Natural Language Processing, Optimization, and Large Language Models (LLMs) being the most common. AI also provides support in test oracle design, where image processing and knowledge representation play key roles, and in test suite optimization, through the use of supervised learning, reinforcement learning, and search-based techniques.

Conclusion: The paper identifies ongoing challenges and outlines future directions, including the need for transparent AI tools, guidelines for LLM integration, and the deployment of a continuously open survey to monitor trends in AI adoption over time.

背景：人工智能（AI）技术在软件测试（ST）中的应用发展迅速，特别是在现代系统日益复杂的情况下。在基于gui的测试中，AI经常被认为是自动化重复任务和提高测试效率的一种很有前途的手段。然而，通过系统的实证调查，人工智能在这一领域的实际应用仍未得到充分探索。目的：本研究旨在分析AI如何在基于gui的测试中被采用，识别所使用的技术和工具，它们支持的测试活动，以及感知到的好处和局限性。方法：采用大规模问卷调查的方法，对107名学术界和产业界人士进行调查。调查集中在三个核心测试活动：测试用例定义、测试oracle设计和测试用例优化。它扩展了先前基于对45位行业从业者访谈的研究。结果：研究结果表明，人工智能主要用于支持测试用例定义，其中自然语言处理、优化和大型语言模型（llm）等技术是最常见的。AI还在测试oracle设计中提供支持，其中图像处理和知识表示起着关键作用，并且通过使用监督学习、强化学习和基于搜索的技术，在测试套件优化中提供支持。结论：本文确定了持续的挑战，并概述了未来的方向，包括对透明人工智能工具的需求，法学硕士集成指南，以及部署持续开放的调查，以监测人工智能采用的趋势。

{"title":"AI in GUI-based testing: A survey of techniques, tools, and perceived advantages and limitations","authors":"Domenico Amalfitano , Riccardo Coppola , Damiano Distante , Filippo Ricca","doi":"10.1016/j.jss.2025.112751","DOIUrl":"10.1016/j.jss.2025.112751","url":null,"abstract":"<div><div><em>Background:</em> The adoption of Artificial Intelligence (AI) techniques in Software Testing (ST) has grown rapidly, particularly in response to the increasing complexity of modern systems. In GUI-based testing, AI is often cited as a promising means to automate repetitive tasks and improve testing efficiency. However, the actual use of AI in this domain remains underexplored through systematic empirical investigation.</div><div><em>Objective:</em> This study aims to analyze how AI is adopted in GUI-based testing, identifying the techniques and tools employed, the testing activities they support, and the perceived benefits and limitations.</div><div><em>Method:</em> We conducted a large-scale survey involving 107 participants from both academia and industry. The survey focuses on three core testing activities: test case definition, test oracle design, and test case optimization. It extends a prior study based on interviews with 45 industry practitioners.</div><div><em>Results:</em>Findings show that AI is primarily used to support test case definition, with techniques such as Natural Language Processing, Optimization, and Large Language Models (LLMs) being the most common. AI also provides support in test oracle design, where image processing and knowledge representation play key roles, and in test suite optimization, through the use of supervised learning, reinforcement learning, and search-based techniques.</div><div><em>Conclusion:</em> The paper identifies ongoing challenges and outlines future directions, including the need for transparent AI tools, guidelines for LLM integration, and the deployment of a continuously open survey to monitor trends in AI adoption over time.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112751"},"PeriodicalIF":4.1,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Software refactoring research with large language models: A systematic literature review 大型语言模型下的软件重构研究：系统文献综述

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software

Pub Date : 2025-12-27 DOI: 10.1016/j.jss.2025.112762

Sofia Martinez , Luo Xu , Mariam Elnaggar, Eman Abdullah Alomar

Background: Code refactoring is the improvement of code internally without changing the external functionalities of the program. Due to its exhaustive nature, developers often avoid manually refactoring code. Researchers have since looked into utilizing Large Language Models (LLMs) to automate the task of refactoring.

Aim and Method: Despite the promising results, there is a lack of clear understanding of LLMs’ effectiveness in automated refactoring. In order to address this issue, we conducted a Systematic Literature Review (SLR) of 50 primary studies. We categorized the studies into different refactoring methods studied, prompt engineering and techniques conducted, LLM tools used, languages used, and datasets used. We touched upon the benchmarks each studies had used, how accurate LLM-generated refactorings are, and the challenges that this field faces currently.

Result: From our literature review we found that: (i) There are various tools that different studies use to enhance and study LLM-driving refactoring, with tools that were used to detect code smells, generate code bases, and compare refactoring outcomes. (ii) Various datasets were collected from multiple open-source projects in multiple programming languages for analysis. These platforms included GitHub, Apache, and F-Droid, with the most popular language collected and analyzed being Java. (iii) One-Shot, Few-Shot, Context-Specific, and Chain-of-Thought prompting methods have been shown to be the most effective depending on the language used. In some instances, being capable of reducing code smell by up to 89%. (iv) The definition of “Accuracy” varies significantly across the literature surveyed, as this depends on the context of the study, thus calling for a need to have a standardized measurement for Accuracy. (v) The most often mentioned code smells were Large Class and Long Method, with a lot of studies also not specifying, while the most often applied refactoring type is Extract Method, showing promising results in using LLM to perform this refactoring type. (vi) When working with LLMs, they often generate erroneous code, struggle with more complex refactoring, and often misunderstand the developers’ requests and miss refactoring requests.

Conclusion: Our study serves to be a collection of knowledge on the topic of LLM refactoring found by various other studies, and to highlight any issues that are often missed by the researchers. We hope to empower and guide the future development of LLM-driven refactoring with our findings.

背景：代码重构是在不改变程序外部功能的情况下对代码进行内部改进。由于它的穷尽性，开发人员经常避免手动重构代码。从那以后，研究人员开始研究利用大型语言模型（llm）来自动化重构任务。目的和方法：尽管取得了令人鼓舞的结果，但人们对llm在自动化重构中的有效性缺乏清晰的认识。为了解决这一问题，我们对50项主要研究进行了系统文献综述（SLR）。我们将研究分为研究的不同重构方法、执行的快速工程和技术、使用的法学硕士工具、使用的语言和使用的数据集。我们谈到了每个研究使用的基准，llm生成的重构有多精确，以及这个领域目前面临的挑战。结果：从我们的文献综述中，我们发现：(i)不同的研究使用各种工具来增强和研究llm驱动的重构，这些工具用于检测代码气味、生成代码库和比较重构结果。以多种编程语言从多个开放源码项目收集了各种数据集以供分析。这些平台包括GitHub、Apache和F-Droid，收集和分析的最流行的语言是Java。（iii）根据所使用的语言，一次性、少数次、特定情境和思维链提示方法已被证明是最有效的。在某些情况下，能够减少高达89%的代码气味。（iv）在所调查的文献中，“准确性”的定义差异很大，因为这取决于研究的背景，因此需要对准确性进行标准化测量。(v)最常提到的代码气味是Large Class和Long Method，很多研究也没有具体说明，而最常应用的重构类型是Extract Method，使用LLM执行这种重构类型显示出有希望的结果。（vi）当与llm一起工作时，他们经常生成错误的代码，与更复杂的重构作斗争，并且经常误解开发人员的请求并错过重构请求。结论：我们的研究收集了其他各种研究中发现的关于LLM重构主题的知识，并突出了研究人员经常遗漏的任何问题。我们希望用我们的发现来增强和指导法学硕士驱动的重构的未来发展。

{"title":"Software refactoring research with large language models: A systematic literature review","authors":"Sofia Martinez , Luo Xu , Mariam Elnaggar, Eman Abdullah Alomar","doi":"10.1016/j.jss.2025.112762","DOIUrl":"10.1016/j.jss.2025.112762","url":null,"abstract":"<div><div>Background: Code refactoring is the improvement of code internally without changing the external functionalities of the program. Due to its exhaustive nature, developers often avoid manually refactoring code. Researchers have since looked into utilizing Large Language Models (LLMs) to automate the task of refactoring.</div><div>Aim and Method: Despite the promising results, there is a lack of clear understanding of LLMs’ effectiveness in automated refactoring. In order to address this issue, we conducted a Systematic Literature Review (SLR) of 50 primary studies. We categorized the studies into different refactoring methods studied, prompt engineering and techniques conducted, LLM tools used, languages used, and datasets used. We touched upon the benchmarks each studies had used, how accurate LLM-generated refactorings are, and the challenges that this field faces currently.</div><div>Result: From our literature review we found that: (i) There are various tools that different studies use to enhance and study LLM-driving refactoring, with tools that were used to detect code smells, generate code bases, and compare refactoring outcomes. (ii) Various datasets were collected from multiple open-source projects in multiple programming languages for analysis. These platforms included GitHub, Apache, and F-Droid, with the most popular language collected and analyzed being Java. (iii) One-Shot, Few-Shot, Context-Specific, and Chain-of-Thought prompting methods have been shown to be the most effective depending on the language used. In some instances, being capable of reducing code smell by up to 89%. (iv) The definition of “Accuracy” varies significantly across the literature surveyed, as this depends on the context of the study, thus calling for a need to have a standardized measurement for Accuracy. (v) The most often mentioned code smells were Large Class and Long Method, with a lot of studies also not specifying, while the most often applied refactoring type is Extract Method, showing promising results in using LLM to perform this refactoring type. (vi) When working with LLMs, they often generate erroneous code, struggle with more complex refactoring, and often misunderstand the developers’ requests and miss refactoring requests.</div><div>Conclusion: Our study serves to be a collection of knowledge on the topic of LLM refactoring found by various other studies, and to highlight any issues that are often missed by the researchers. We hope to empower and guide the future development of LLM-driven refactoring with our findings.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112762"},"PeriodicalIF":4.1,"publicationDate":"2025-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mitigation strategies for confidentiality violations in software architecture using ranked feature importance 使用特性重要性排序的软件架构中违反机密性的缓解策略

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software

Pub Date : 2025-12-26 DOI: 10.1016/j.jss.2025.112761

Nils Niehues , Sebastian Hahner , Robert Heinrich

A quality attribute like confidentiality is critical to trustworthy software but unfortunately, very challenging to ensure. This is because modern software systems are complex and interconnected. Architecture-based confidentiality analysis enables the early detection of violations, helping to mitigate risks before deployment. However, uncertainty in software systems and their environments complicates precise and comprehensive architectural analysis. Additionally, the complexity of software models and the exponential growth of uncertainty scenarios pose significant challenges for automated mitigation, often leaving software architects to resolve confidentiality violations manually, a process that is both time-intensive and error-prone.

In this paper, we extend our machine-learning-based approach to mitigate confidentiality violations. Specifically, we introduce a novel mitigation strategy inspired by TCP Congestion Control, as well as a strategy that capitalizes on clustering techniques to dynamically adjust batch sizes. Our evaluation on three real-world software architectures demonstrates that our extended approach can mitigate confidentiality violations while outperforming the state-of-the-art. Whereas previously the upper limit was 60 times runtime reduction, now we achieve 2298 times reduction, with the median being an elevenfold reduction. Our statistical analysis confirms that the added TCP-inspired strategy is significantly cheaper than the state-of-the-art baseline (Friedman test

p = . 025

and Nemenyi post hoc test

p = . 039

), while also having a strong practical impact (Kendall’s W

= 0.721

). This extended work deepens our understanding of the nature of uncertainty and also of the techniques optimally suited to mitigating the violations caused by uncertainties. It takes us one step closer to designing trustworthier systems.

像保密性这样的质量属性对于值得信赖的软件是至关重要的，但不幸的是，很难保证。这是因为现代软件系统是复杂且相互关联的。基于体系结构的机密性分析支持早期检测违规，有助于在部署之前降低风险。然而，软件系统及其环境中的不确定性使精确和全面的体系结构分析变得复杂。此外，软件模型的复杂性和不确定性场景的指数级增长对自动化缓解提出了重大挑战，通常让软件架构师手动解决违反机密性的问题，这是一个既耗时又容易出错的过程。在本文中，我们扩展了基于机器学习的方法来减轻保密违规。具体来说，我们介绍了一种受TCP拥塞控制启发的新型缓解策略，以及一种利用集群技术动态调整批大小的策略。我们对三个现实世界软件架构的评估表明，我们的扩展方法可以在优于最先进技术的同时减轻机密性违规。而之前的上限是60倍的运行时减少，现在我们实现了2298倍的减少，中间值是11倍的减少。我们的统计分析证实，增加的tcp启发策略比最先进的基线要便宜得多（Friedman检验p= 0.025， Nemenyi事后检验p= 0.039），同时也具有很强的实际影响（Kendall的W =0.721）。这项扩展的工作加深了我们对不确定性本质的理解，也加深了我们对最适合于减轻不确定性造成的违规的技术的理解。它使我们离设计更值得信赖的系统更近了一步。

{"title":"Mitigation strategies for confidentiality violations in software architecture using ranked feature importance","authors":"Nils Niehues , Sebastian Hahner , Robert Heinrich","doi":"10.1016/j.jss.2025.112761","DOIUrl":"10.1016/j.jss.2025.112761","url":null,"abstract":"<div><div>A quality attribute like confidentiality is critical to trustworthy software but unfortunately, very challenging to ensure. This is because modern software systems are complex and interconnected. Architecture-based confidentiality analysis enables the early detection of violations, helping to mitigate risks before deployment. However, uncertainty in software systems and their environments complicates precise and comprehensive architectural analysis. Additionally, the complexity of software models and the exponential growth of uncertainty scenarios pose significant challenges for automated mitigation, often leaving software architects to resolve confidentiality violations manually, a process that is both time-intensive and error-prone.</div><div>In this paper, we extend our machine-learning-based approach to mitigate confidentiality violations. Specifically, we introduce a novel mitigation strategy inspired by TCP Congestion Control, as well as a strategy that capitalizes on clustering techniques to dynamically adjust batch sizes. Our evaluation on three real-world software architectures demonstrates that our extended approach can mitigate confidentiality violations while outperforming the state-of-the-art. Whereas previously the upper limit was 60 times runtime reduction, now we achieve 2298 times reduction, with the median being an elevenfold reduction. Our statistical analysis confirms that the added TCP-inspired strategy is significantly cheaper than the state-of-the-art baseline (Friedman test <span><math><mrow><mi>p</mi><mo>=</mo><mo>.</mo><mn>025</mn></mrow></math></span> and Nemenyi post hoc test <span><math><mrow><mi>p</mi><mo>=</mo><mo>.</mo><mn>039</mn></mrow></math></span>), while also having a strong practical impact (Kendall’s W <span><math><mrow><mo>=</mo><mn>0.721</mn></mrow></math></span>). This extended work deepens our understanding of the nature of uncertainty and also of the techniques optimally suited to mitigating the violations caused by uncertainties. It takes us one step closer to designing trustworthier systems.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112761"},"PeriodicalIF":4.1,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ethics label for digital systems to promote transparency and user awareness 数字系统的道德标签，以提高透明度和用户意识

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software

Pub Date : 2025-12-26 DOI: 10.1016/j.jss.2025.112752

Marco Autili , Riccardo Corsi , Martina De Sanctis , Paola Inverardi , Patrizio Pelliccione

Modern digital systems pose risks to humans, society, and the environment. There is a flourishing of guidelines, recommendations, laws, and regulations, but also of standards that try to regulate and alleviate the lack of good practices for the governance, management, and quality of AI systems. However, having just a mark certifying that the system passed some checks is not enough and is fragile to the ethics washing problem. The objective of this work is to go beyond compliance to standards toward an ethics label that enables users to understand the impact of systems on human, societal, and environmental values, both during system development and usage. To build the ethics label, we analyze guidelines, recommendations, laws and regulations, and quality standards. We validate the proposed ethics label through (i) a proof-of-concept application to the social assistive robotics domain, and (ii) interviews with experts from both academia and industry. We contribute an ethics label for modern digital systems that promote transparency and user awareness, and enable users to select systems that meet their subjective ethical preferences. We discuss the problem of ethics washing and propose an ethics label as a means to foster transparency and raise user awareness.

现代数字系统给人类、社会和环境带来了风险。有大量的指导方针、建议、法律和法规，但也有大量的标准试图规范和减轻人工智能系统治理、管理和质量方面缺乏良好实践的问题。然而，仅仅有一个证明该系统通过了一些检查的标志是不够的，并且容易受到道德洗涤问题的影响。这项工作的目标是超越对标准的遵从，走向一个道德标签，使用户能够理解系统在系统开发和使用期间对人类、社会和环境价值的影响。为了构建道德标签，我们分析了指导方针、建议、法律法规和质量标准。我们通过(i)对社会辅助机器人领域的概念验证应用，以及（ii）对学术界和工业界专家的采访来验证拟议的伦理标签。我们为现代数字系统提供道德标签，以提高透明度和用户意识，并使用户能够选择符合其主观道德偏好的系统。我们讨论了道德洗涤的问题，并提出了一个道德标签作为一种手段，以促进透明度和提高用户意识。

{"title":"Ethics label for digital systems to promote transparency and user awareness","authors":"Marco Autili , Riccardo Corsi , Martina De Sanctis , Paola Inverardi , Patrizio Pelliccione","doi":"10.1016/j.jss.2025.112752","DOIUrl":"10.1016/j.jss.2025.112752","url":null,"abstract":"<div><div>Modern digital systems pose risks to humans, society, and the environment. There is a flourishing of guidelines, recommendations, laws, and regulations, but also of standards that try to regulate and alleviate the lack of good practices for the governance, management, and quality of AI systems. However, having just a mark certifying that the system passed some checks is not enough and is fragile to the ethics washing problem. The objective of this work is to go beyond compliance to standards toward an ethics label that enables users to understand the impact of systems on human, societal, and environmental values, both during system development and usage. To build the ethics label, we analyze guidelines, recommendations, laws and regulations, and quality standards. We validate the proposed ethics label through (i) a proof-of-concept application to the social assistive robotics domain, and (ii) interviews with experts from both academia and industry. We contribute an ethics label for modern digital systems that promote transparency and user awareness, and enable users to select systems that meet their subjective ethical preferences. We discuss the problem of ethics washing and propose an ethics label as a means to foster transparency and raise user awareness.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112752"},"PeriodicalIF":4.1,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146038341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A systematic literature review of software engineering research on Jupyter notebook 对Jupyter笔记本软件工程研究的系统文献综述

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software

Pub Date : 2025-12-25 DOI: 10.1016/j.jss.2025.112758

Md Saeed Siddik , Hao Li , Cor-Paul Bezemer

Context: Jupyter Notebook has emerged as a versatile tool that transforms how researchers, developers, and data scientists conduct and communicate their work. As the adoption of Jupyter notebooks continues to rise, so does the interest from the software engineering research community in improving the software engineering practices for Jupyter notebooks.

Objective: The purpose of this study is to analyze trends, gaps, and methodologies used in software engineering research on Jupyter notebooks.

Method: We selected 199 relevant publications up to September 2025, following established systematic literature review guidelines. We explored publication trends, categorized them based on software engineering topics, and reported findings based on those topics.

Results: The most popular venues for publishing software engineering research on Jupyter notebooks are related to human-computer interaction instead of traditional software engineering venues. Researchers have addressed a wide range of software engineering topics on notebooks, such as code reuse, readability, and execution environment. Although reusability is one of the research topics for Jupyter notebooks, only 82 of the 199 studies can be reused based on their provided URLs. Additionally, most replication packages are not hosted on permanent repositories for long-term availability and adherence to open science principles.

Conclusion: Solutions specific to notebooks for software engineering issues, including testing, refactoring, and documentation, are underexplored. Future research opportunities exist in automatic testing frameworks, refactoring clones between notebooks, and generating group documentation for coherent code cells.

背景：Jupyter Notebook已经成为一款多功能工具，它改变了研究人员、开发人员和数据科学家进行和交流工作的方式。随着对Jupyter笔记本的采用不断增加，软件工程研究社区对改进Jupyter笔记本的软件工程实践的兴趣也在增加。目的：本研究的目的是分析Jupyter笔记本软件工程研究中使用的趋势、差距和方法。方法：选取截至2025年9月的199篇相关文献，遵循已建立的系统文献综述指南。我们探索出版趋势，根据软件工程主题对它们进行分类，并根据这些主题报告发现。结果：在Jupyter笔记本上发布软件工程研究成果最受欢迎的场所是人机交互相关的，而不是传统的软件工程场所。研究人员已经在笔记本上讨论了广泛的软件工程主题，例如代码重用、可读性和执行环境。尽管可重用性是Jupyter笔记本的研究主题之一，但199项研究中只有82项可以基于其提供的url进行重用。此外，大多数复制包不是托管在长期可用性和遵守开放科学原则的永久存储库上。结论：针对软件工程问题的笔记本的解决方案，包括测试、重构和文档，还没有得到充分的研究。未来的研究机会存在于自动测试框架，重构笔记本之间的克隆，以及为一致的代码单元生成组文档。

{"title":"A systematic literature review of software engineering research on Jupyter notebook","authors":"Md Saeed Siddik , Hao Li , Cor-Paul Bezemer","doi":"10.1016/j.jss.2025.112758","DOIUrl":"10.1016/j.jss.2025.112758","url":null,"abstract":"<div><div><em>Context</em>: Jupyter Notebook has emerged as a versatile tool that transforms how researchers, developers, and data scientists conduct and communicate their work. As the adoption of Jupyter notebooks continues to rise, so does the interest from the software engineering research community in improving the software engineering practices for Jupyter notebooks.</div><div><em>Objective</em>: The purpose of this study is to analyze trends, gaps, and methodologies used in software engineering research on Jupyter notebooks.</div><div><em>Method</em>: We selected 199 relevant publications up to September 2025, following established systematic literature review guidelines. We explored publication trends, categorized them based on software engineering topics, and reported findings based on those topics.</div><div><em>Results</em>: The most popular venues for publishing software engineering research on Jupyter notebooks are related to human-computer interaction instead of traditional software engineering venues. Researchers have addressed a wide range of software engineering topics on notebooks, such as code reuse, readability, and execution environment. Although reusability is one of the research topics for Jupyter notebooks, only 82 of the 199 studies can be reused based on their provided URLs. Additionally, most replication packages are not hosted on permanent repositories for long-term availability and adherence to open science principles.</div><div><em>Conclusion</em>: Solutions specific to notebooks for software engineering issues, including testing, refactoring, and documentation, are underexplored. Future research opportunities exist in automatic testing frameworks, refactoring clones between notebooks, and generating group documentation for coherent code cells.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"235 ","pages":"Article 112758"},"PeriodicalIF":4.1,"publicationDate":"2025-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145852497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0