Empirical Software Engineering最新文献_第2页

Challenges and practices of deep learning model reengineering: A case study on computer vision 深度学习模型再造的挑战与实践：计算机视觉案例研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-08-20 DOI: 10.1007/s10664-024-10521-0

Wenxin Jiang, Vishnu Banna, Naveen Vivek, Abhinav Goel, Nicholas Synovic, George K. Thiruvathukal, James C. Davis

Context

Many engineering organizations are reimplementing and extending deep neural networks from the research community. We describe this process as deep learning model reengineering. Deep learning model reengineering — reusing, replicating, adapting, and enhancing state-of-the-art deep learning approaches — is challenging for reasons including under-documented reference models, changing requirements, and the cost of implementation and testing.

Objective

Prior work has characterized the challenges of deep learning model development, but as yet we know little about the deep learning model reengineering process and its common challenges. Prior work has examined DL systems from a “product” view, examining defects from projects regardless of the engineers’ purpose. Our study is focused on reengineering activities from a “process” view, and focuses on engineers specifically engaged in the reengineering process.

Method

Our goal is to understand the characteristics and challenges of deep learning model reengineering. We conducted a mixed-methods case study of this phenomenon, focusing on the context of computer vision. Our results draw from two data sources: defects reported in open-source reeengineering projects, and interviews conducted with practitioners and the leaders of a reengineering team. From the defect data source, we analyzed 348 defects from 27 open-source deep learning projects. Meanwhile, our reengineering team replicated 7 deep learning models over two years; we interviewed 2 open-source contributors, 4 practitioners, and 6 reengineering team leaders to understand their experiences.

Results

Our results describe how deep learning-based computer vision techniques are reengineered, quantitatively analyze the distribution of defects in this process, and qualitatively discuss challenges and practices. We found that most defects (58%) are reported by re-users, and that reproducibility-related defects tend to be discovered during training (68% of them are). Our analysis shows that most environment defects (88%) are interface defects, and most environment defects (46%) are caused by API defects. We found that training defects have diverse symptoms and root causes. We identified four main challenges in the DL reengineering process: model operationalization, performance debugging, portability of DL operations, and customized data pipeline. Integrating our quantitative and qualitative data, we propose a novel reengineering workflow.

Conclusions

Our findings inform several conclusion, including: standardizing model reengineering practices, developing validation tools to support model reengineering, automated support beyond manual model reengineering, and measuring additional unknown aspects of model reengineering.

背景许多工程机构正在重新实施和扩展研究界的深度神经网络。我们将这一过程称为深度学习模型再造。深度学习模型再造--重新使用、复制、调整和增强最先进的深度学习方法--具有挑战性，原因包括参考模型记录不足、需求不断变化以及实施和测试成本。之前的工作是从 "产品 "的角度来研究深度学习系统，研究项目中的缺陷，而不考虑工程师的目的。我们的研究侧重于从 "过程 "的角度来研究再造活动，并将重点放在具体参与再造过程的工程师身上。方法我们的目标是了解深度学习模型再造的特点和挑战。我们以计算机视觉为背景，对这一现象进行了混合方法案例研究。我们的研究结果来自两个数据源：开源再设计项目中报告的缺陷，以及对从业人员和再设计团队领导的访谈。从缺陷数据源中，我们分析了 27 个开源深度学习项目中的 348 个缺陷。同时，我们的再设计团队在两年内复制了 7 个深度学习模型；我们采访了 2 名开源贡献者、4 名从业人员和 6 名再设计团队负责人，以了解他们的经验。结果我们的结果描述了如何对基于深度学习的计算机视觉技术进行再设计，定量分析了这一过程中的缺陷分布，并定性讨论了挑战和实践。我们发现，大多数缺陷（58%）都是由再使用者报告的，而与可重复性相关的缺陷往往是在训练过程中发现的（68%）。我们的分析表明，大多数环境缺陷（88%）是界面缺陷，而大多数环境缺陷（46%）是由应用程序接口缺陷造成的。我们发现，训练缺陷的症状和根本原因多种多样。我们确定了 DL 重构过程中的四大挑战：模型操作化、性能调试、DL 操作的可移植性和定制数据管道。综合定量和定性数据，我们提出了一种新颖的再设计工作流程。结论我们的研究结果为若干结论提供了参考，包括：模型再设计实践的标准化、开发支持模型再设计的验证工具、人工模型再设计之外的自动支持，以及测量模型再设计的其他未知方面。

{"title":"Challenges and practices of deep learning model reengineering: A case study on computer vision","authors":"Wenxin Jiang, Vishnu Banna, Naveen Vivek, Abhinav Goel, Nicholas Synovic, George K. Thiruvathukal, James C. Davis","doi":"10.1007/s10664-024-10521-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10521-0","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Many engineering organizations are reimplementing and extending deep neural networks from the research community. We describe this process as deep learning model reengineering. Deep learning model reengineering — reusing, replicating, adapting, and enhancing state-of-the-art deep learning approaches — is challenging for reasons including under-documented reference models, changing requirements, and the cost of implementation and testing.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>Prior work has characterized the challenges of deep learning model development, but as yet we know little about the deep learning model reengineering process and its common challenges. Prior work has examined DL systems from a “product” view, examining defects from projects regardless of the engineers’ purpose. Our study is focused on reengineering activities from a “process” view, and focuses on engineers specifically engaged in the reengineering process.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>Our goal is to understand the characteristics and challenges of deep learning model reengineering. We conducted a mixed-methods case study of this phenomenon, focusing on the context of computer vision. Our results draw from two data sources: defects reported in open-source reeengineering projects, and interviews conducted with practitioners and the leaders of a reengineering team. From the defect data source, we analyzed 348 defects from 27 open-source deep learning projects. Meanwhile, our reengineering team replicated 7 deep learning models over two years; we interviewed 2 open-source contributors, 4 practitioners, and 6 reengineering team leaders to understand their experiences.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Our results describe how deep learning-based computer vision techniques are reengineered, quantitatively analyze the distribution of defects in this process, and qualitatively discuss challenges and practices. We found that most defects (58%) are reported by re-users, and that reproducibility-related defects tend to be discovered during training (68% of them are). Our analysis shows that most environment defects (88%) are interface defects, and most environment defects (46%) are caused by API defects. We found that training defects have diverse symptoms and root causes. We identified four main challenges in the DL reengineering process: model operationalization, performance debugging, portability of DL operations, and customized data pipeline. Integrating our quantitative and qualitative data, we propose a novel reengineering workflow.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Our findings inform several conclusion, including: standardizing model reengineering practices, developing validation tools to support model reengineering, automated support beyond manual model reengineering, and measuring additional unknown aspects of model reengineering.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"7 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How does parenthood affect an ICT practitioner’s work? A survey study with fathers 为人父母如何影响信息和通信技术从业人员的工作？一项针对父亲的调查研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-08-19 DOI: 10.1007/s10664-024-10534-9

Larissa Rocha, Edna Dias Canedo, Claudia Pinto Pereira, Carla Bezerra, Fabiana Freitas Mendes

Context

Many studies have investigated the perception of software development teams about gender bias, inclusion policies, and the impact of remote work on productivity. The studies indicate that mothers and fathers working in the software industry had to reconcile homework, work activities, and child care.

Objective

This study investigates the impact of parenthood on Information and Communications Technology (ICT) and how the fathers perceive the mothers’ challenges. Recognizing their difficulties and knowing the mothers’ challenges can be the first step towards making the work environment friendlier for everyone.

Method

We surveyed 155 fathers from industry and academia from 10 different countries, however, most of the respondents are from Brazil (92.3%). Data was analyzed quantitatively and qualitatively. We employed Grounded Theory to identify factors related to (i) paternity leave, (ii) working time after paternity, (iii) childcare-related activities, (iv) prejudice at work after paternity, (v) fathers’ perception of prejudice against mothers, (vi) challenges and difficulties in paternity. We also conduct a correlational and regression analysis to explore some research questions further.

Results

In general, fathers do not suffer harassment or prejudice at work for being fathers. However, they perceive that mothers suffer distrust in the workplace and live with work overload because they have to dedicate themselves to many activities. They also suggested actions to mitigate parents’ difficulties.

Conclusions

Despite some fathers wanting to participate more in taking care of their children, others do not even recognize the difficulties that mothers can face in the work. Therefore, it is important to explore the problems and implement actions to build a more parent-friendly work environment.

背景许多研究调查了软件开发团队对性别偏见、包容政策以及远程工作对生产率的影响的看法。本研究调查了为人父母对信息和通信技术（ICT）的影响，以及父亲如何看待母亲所面临的挑战。方法我们调查了来自 10 个不同国家的 155 名工业界和学术界的父亲，但大多数受访者来自巴西（92.3%）。我们对数据进行了定量和定性分析。我们采用基础理论来确定与以下方面有关的因素：(i) 陪产假，(ii) 陪产假后的工作时间，(iii) 与育儿有关的活动，(iv) 陪产假后工作中的偏见，(v) 父亲对母亲偏见的看法，(vi) 陪产假中的挑战和困难。我们还进行了相关分析和回归分析，以进一步探讨一些研究问题。然而，他们认为母亲在工作场所会受到不信任，并且由于她们必须全身心地投入到许多活动中，因此工作负担过重。结论尽管有些父亲希望更多地参与照顾子女，但有些父亲甚至没有认识到母亲在工作中可能面临的困难。因此，重要的是要探索问题所在，并采取行动建立一个更有利于父母的工作环境。

{"title":"How does parenthood affect an ICT practitioner’s work? A survey study with fathers","authors":"Larissa Rocha, Edna Dias Canedo, Claudia Pinto Pereira, Carla Bezerra, Fabiana Freitas Mendes","doi":"10.1007/s10664-024-10534-9","DOIUrl":"https://doi.org/10.1007/s10664-024-10534-9","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Many studies have investigated the perception of software development teams about gender bias, inclusion policies, and the impact of remote work on productivity. The studies indicate that mothers and fathers working in the software industry had to reconcile homework, work activities, and child care.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>This study investigates the impact of parenthood on Information and Communications Technology (ICT) and how the fathers perceive the mothers’ challenges. Recognizing their difficulties and knowing the mothers’ challenges can be the first step towards making the work environment friendlier for everyone.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We surveyed 155 fathers from industry and academia from 10 different countries, however, most of the respondents are from Brazil (92.3%). Data was analyzed quantitatively and qualitatively. We employed Grounded Theory to identify factors related to (i) paternity leave, (ii) working time after paternity, (iii) childcare-related activities, (iv) prejudice at work after paternity, (v) fathers’ perception of prejudice against mothers, (vi) challenges and difficulties in paternity. We also conduct a correlational and regression analysis to explore some research questions further.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>In general, fathers do not suffer harassment or prejudice at work for being fathers. However, they perceive that mothers suffer distrust in the workplace and live with work overload because they have to dedicate themselves to many activities. They also suggested actions to mitigate parents’ difficulties.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Despite some fathers wanting to participate more in taking care of their children, others do not even recognize the difficulties that mothers can face in the work. Therefore, it is important to explore the problems and implement actions to build a more parent-friendly work environment.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"15 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recommendations for analysing and meta-analysing small sample size software engineering experiments 对小样本量软件工程实验进行分析和元分析的建议

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-08-17 DOI: 10.1007/s10664-024-10504-1

Barbara Kitchenham, Lech Madeyski

Context

Software engineering (SE) experiments often have small sample sizes. This can result in data sets with non-normal characteristics, which poses problems as standard parametric meta-analysis, using the standardized mean difference (StdMD) effect size, assumes normally distributed sample data. Small sample sizes and non-normal data set characteristics can also lead to unreliable estimates of parametric effect sizes. Meta-analysis is even more complicated if experiments use complex experimental designs, such as two-group and four-group cross-over designs, which are popular in SE experiments.

Objective

Our objective was to develop a validated and robust meta-analysis method that can help to address the problems of small sample sizes and complex experimental designs without relying upon data samples being normally distributed.

Method

To illustrate the challenges, we used real SE data sets. We built upon previous research and developed a robust meta-analysis method able to deal with challenges typical for SE experiments. We validated our method via simulations comparing StdMD with two robust alternatives: the probability of superiority ((hat{p})) and Cliffs’ d.

Results

We confirmed that many SE data sets are small and that small experiments run the risk of exhibiting non-normal properties, which can cause problems for analysing families of experiments. For simulations of individual experiments and meta-analyses of families of experiments, (hat{p}) and Cliff’s d consistently outperformed StdMD in terms of negligible small sample bias. They also had better power for log-normal and Laplace samples, although lower power for normal and gamma samples. Tests based on (hat{p}) always had better or equal power than tests based on Cliff’s d, and across all but one simulation condition, (hat{p}) Type 1 error rates were less biased.

Conclusions

Using (hat{p}) is a low-risk option for analysing and meta-analysing data from small sample-size SE randomized experiments. Parametric methods are only preferable if you have prior knowledge of the data distribution.

背景软件工程（SE）实验的样本量通常较小。这可能导致数据集具有非正态分布特征，从而带来问题，因为使用标准化均值差异（stdMD）效应大小的标准参数元分析假定样本数据是正态分布的。小样本量和非正态数据集特征也会导致参数效应大小的估计值不可靠。如果实验采用了复杂的实验设计，如 SE 实验中常用的两组和四组交叉设计，则元分析会更加复杂。我们的目标是开发一种经过验证的稳健元分析方法，它可以帮助解决小样本量和复杂实验设计的问题，而无需依赖数据样本的正态分布。我们在以往研究的基础上，开发了一种稳健的荟萃分析方法，能够应对 SE 实验中的典型挑战。我们通过模拟验证了我们的方法，并将 StdMD 与两个稳健的替代方法进行了比较：优越性概率（(hat{p})）和 Cliffs' d.结果我们证实，许多 SE 数据集都很小，而且小实验有可能表现出非正态属性，这可能会给实验族的分析带来问题。对于单个实验的模拟和实验族的元分析，(hhat{p}) 和 Cliff's d 在可忽略的小样本偏差方面始终优于 StdMD。在对数正态和拉普拉斯样本方面，它们也有更好的功率，但在正态和伽马样本方面功率较低。基于 (hat{p}) 的检验总是比基于 Cliff's d 的检验具有更好的或相同的功率，而且除了一种模拟条件外，在所有条件下， (hat{p}) 类型 1 错误率的偏差都较小。参数方法只有在事先了解数据分布的情况下才更可取。

{"title":"Recommendations for analysing and meta-analysing small sample size software engineering experiments","authors":"Barbara Kitchenham, Lech Madeyski","doi":"10.1007/s10664-024-10504-1","DOIUrl":"https://doi.org/10.1007/s10664-024-10504-1","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Software engineering (SE) experiments often have small sample sizes. This can result in data sets with non-normal characteristics, which poses problems as standard parametric meta-analysis, using the standardized mean difference (<i>StdMD</i>) effect size, assumes normally distributed sample data. Small sample sizes and non-normal data set characteristics can also lead to unreliable estimates of parametric effect sizes. Meta-analysis is even more complicated if experiments use complex experimental designs, such as two-group and four-group cross-over designs, which are popular in SE experiments.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>Our objective was to develop a validated and robust meta-analysis method that can help to address the problems of small sample sizes and complex experimental designs without relying upon data samples being normally distributed.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>To illustrate the challenges, we used real SE data sets. We built upon previous research and developed a robust meta-analysis method able to deal with challenges typical for SE experiments. We validated our method via simulations comparing <i>StdMD</i> with two robust alternatives: the probability of superiority (<span>(hat{p})</span>) and Cliffs’ <i>d</i>.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>We confirmed that many SE data sets are small and that small experiments run the risk of exhibiting non-normal properties, which can cause problems for analysing families of experiments. For simulations of individual experiments and meta-analyses of families of experiments, <span>(hat{p})</span> and Cliff’s <i>d</i> consistently outperformed <i>StdMD</i> in terms of negligible small sample bias. They also had better power for log-normal and Laplace samples, although lower power for normal and gamma samples. Tests based on <span>(hat{p})</span> always had better or equal power than tests based on Cliff’s <i>d</i>, and across all but one simulation condition, <span>(hat{p})</span> Type 1 error rates were less biased.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Using <span>(hat{p})</span> is a low-risk option for analysing and meta-analysing data from small sample-size SE randomized experiments. Parametric methods are only preferable if you have prior knowledge of the data distribution.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"281 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Augmented testing to support manual GUI-based regression testing: An empirical study 支持基于图形用户界面的手动回归测试的增强测试：实证研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-08-17 DOI: 10.1007/s10664-024-10522-z

Andreas Bauer, Julian Frattini, Emil Alégroth

Context

Manual graphical user interface (GUI) software testing presents a substantial part of the overall practiced testing efforts, despite various research efforts to further increase test automation. Augmented Testing (AT), a novel approach for GUI testing, aims to aid manual GUI-based testing through a tool-supported approach where an intermediary visual layer is rendered between the system under test (SUT) and the tester, superimposing relevant test information.

Objective

The primary objective of this study is to gather empirical evidence regarding AT’s efficiency compared to manual GUI-based regression testing. Existing studies involving testing approaches under the AT definition primarily focus on exploratory GUI testing, leaving a gap in the context of regression testing. As a secondary objective, we investigate AT’s benefits, drawbacks, and usability issues when deployed with the demonstrator tool, Scout.

Method

We conducted an experiment involving 13 industry professionals, from six companies, comparing AT to manual GUI-based regression testing. These results were complemented by interviews and Bayesian data analysis (BDA) of the study’s quantitative results.

Results

The results of the Bayesian data analysis revealed that the use of AT shortens test durations in 70% of the cases on average, concluding that AT is more efficient. When comparing the means of the total duration to perform all tests, AT reduced the test duration by 36% in total. Participant interviews highlighted nine benefits and eleven drawbacks of AT, while observations revealed four usability issues.

Conclusion

This study presents empirical evidence of improved efficiency using AT in the context of manual GUI-based regression testing. We further report AT’s benefits, drawbacks, and usability issues. The majority of identified usability issues and drawbacks can be attributed to the tool implementation of AT and, thus, can serve as valuable input for future tool development.

背景手动图形用户界面（GUI）软件测试占整个实践测试工作的很大一部分，尽管有各种研究努力进一步提高测试自动化。增强测试（AT）是一种新颖的图形用户界面测试方法，旨在通过一种工具支持的方法来辅助基于图形用户界面的手动测试，在被测系统（SUT）和测试人员之间呈现一个中间可视层，叠加相关的测试信息。涉及 AT 定义下的测试方法的现有研究主要集中在探索性图形用户界面测试上，在回归测试方面留下了空白。作为次要目标，我们研究了 AT 与示范工具 Scout 一起部署时的优点、缺点和可用性问题。方法我们进行了一项实验，来自 6 家公司的 13 位业内专业人士参加了实验，比较了 AT 与基于图形用户界面的手动回归测试。结果贝叶斯数据分析结果显示，使用自动测试平均缩短了 70% 的测试时间，因此自动测试更有效。在比较执行所有测试的总时长的平均值时，自动获取技术总共缩短了 36% 的测试时长。对参与者的访谈强调了 AT 的 9 个优点和 11 个缺点，而观察则发现了 4 个可用性问题。结论本研究提供了在基于图形用户界面的手动回归测试中使用 AT 提高效率的经验证据。我们进一步报告了 AT 的优点、缺点和可用性问题。大部分发现的可用性问题和缺点可归因于 AT 工具的实施，因此可作为未来工具开发的宝贵投入。

{"title":"Augmented testing to support manual GUI-based regression testing: An empirical study","authors":"Andreas Bauer, Julian Frattini, Emil Alégroth","doi":"10.1007/s10664-024-10522-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10522-z","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Manual graphical user interface (GUI) software testing presents a substantial part of the overall practiced testing efforts, despite various research efforts to further increase test automation. Augmented Testing (AT), a novel approach for GUI testing, aims to aid manual GUI-based testing through a tool-supported approach where an intermediary visual layer is rendered between the system under test (SUT) and the tester, superimposing relevant test information.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>The primary objective of this study is to gather empirical evidence regarding AT’s efficiency compared to manual GUI-based regression testing. Existing studies involving testing approaches under the AT definition primarily focus on exploratory GUI testing, leaving a gap in the context of regression testing. As a secondary objective, we investigate AT’s benefits, drawbacks, and usability issues when deployed with the demonstrator tool, Scout.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We conducted an experiment involving 13 industry professionals, from six companies, comparing AT to manual GUI-based regression testing. These results were complemented by interviews and Bayesian data analysis (BDA) of the study’s quantitative results.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The results of the Bayesian data analysis revealed that the use of AT shortens test durations in 70% of the cases on average, concluding that AT is more efficient. When comparing the means of the total duration to perform all tests, AT reduced the test duration by 36% in total. Participant interviews highlighted nine benefits and eleven drawbacks of AT, while observations revealed four usability issues.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>This study presents empirical evidence of improved efficiency using AT in the context of manual GUI-based regression testing. We further report AT’s benefits, drawbacks, and usability issues. The majority of identified usability issues and drawbacks can be attributed to the tool implementation of AT and, thus, can serve as valuable input for future tool development.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"59 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The best ends by the best means: ethical concerns in app reviews 以最佳手段实现最佳目的：应用程序评论中的道德问题

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-08-17 DOI: 10.1007/s10664-024-10463-7

Neelam Tjikhoeri, Lauren Olson, Emitzá Guzmán

This work analyzes ethical concerns found in users’ app store reviews. We performed this study because ethical concerns in mobile applications (apps) are widespread, pose severe threats to end users and society, and lack systematic analysis and methods for detection and classification. In addition, app store reviews allow practitioners to collect users’ perspectives, crucial for identifying software flaws, from a geographically distributed and large-scale audience. For our analysis, we collected five million user reviews, developed a set of ethical concerns representative of user preferences, and manually labeled a sample of these reviews. We found that (1) users highly report ethical concerns about censorship, identity theft, and safety (2) user reviews with ethical concerns are longer, more popular, and lowly rated, and (3) there is high automation potential for the classification and filtering of these reviews. Our results highlight the relevance of using app store reviews for the systematic consideration of ethical concerns during software evolution.

本研究分析了用户在应用商店评论中发现的道德问题。我们之所以进行这项研究，是因为移动应用程序（Apps）中的道德问题非常普遍，对最终用户和社会构成严重威胁，而且缺乏系统的分析和检测与分类方法。此外，应用程序商店的评论允许从业人员从分布在不同地区的大规模受众中收集用户的观点，这对识别软件缺陷至关重要。为了进行分析，我们收集了五百万条用户评论，制定了一套代表用户偏好的道德问题，并对这些评论样本进行了人工标注。我们发现：(1) 用户对审查、身份盗用和安全等道德问题的报告率很高；(2) 有道德问题的用户评论篇幅较长、更受欢迎且评分较低；(3) 这些评论的分类和过滤具有很高的自动化潜力。我们的研究结果凸显了在软件进化过程中使用应用商店评论来系统考虑道德问题的相关性。

引用次数: 0

Static analysis driven enhancements for comprehension in machine learning notebooks 静态分析驱动增强机器学习笔记本的理解能力

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-08-12 DOI: 10.1007/s10664-024-10525-w

Ashwin Prasad Shivarpatna Venkatesh, Samkutty Sabu, Mouli Chekkapalli, Jiawei Wang, Li Li, Eric Bodden

Jupyter notebooks have emerged as the predominant tool for data scientists to develop and share machine learning solutions, primarily using Python as the programming language. Despite their widespread adoption, a significant fraction of these notebooks, when shared on public repositories, suffer from insufficient documentation and a lack of coherent narrative. Such shortcomings compromise the readability and understandability of the notebook. Addressing this shortcoming, this paper introduces HeaderGen, a tool-based approach that automatically augments code cells in these notebooks with descriptive markdown headers, derived from a predefined taxonomy of machine learning operations. Additionally, it systematically classifies and displays function calls in line with this taxonomy. The mechanism that powers HeaderGen is an enhanced call graph analysis technique, building upon the foundational analysis available in PyCG. To improve precision, HeaderGen extends PyCG’s analysis with return-type resolution of external function calls, type inference, and flow-sensitivity. Furthermore, leveraging type information, HeaderGen employs pattern matching techniques on the code syntax to annotate code cells. We conducted an empirical evaluation on 15 real-world Jupyter notebooks sourced from Kaggle. The results indicate a high accuracy in call graph analysis, with precision at 95.6% and recall at 95.3%. The header generation has a precision of 85.7% and a recall rate of 92.8% with regard to headers created manually by experts. A user study corroborated the practical utility of HeaderGen, revealing that users found HeaderGen useful in tasks related to comprehension and navigation. To further evaluate the type inference capability of static analysis tools, we introduce TypeEvalPy, a framework for evaluating type inference tools for Python with an in-built micro-benchmark containing 154 code snippets and 845 type annotations in the ground truth. Our comparative analysis on four tools revealed that HeaderGen outperforms other tools in exact matches with the ground truth.

Jupyter 笔记本已成为数据科学家开发和共享机器学习解决方案的主要工具，主要使用 Python 作为编程语言。尽管这些笔记本被广泛采用，但在公共存储库中共享时，有相当一部分笔记本存在文档不足和缺乏连贯叙述的问题。这些缺陷影响了笔记本的可读性和可理解性。为了解决这一缺陷，本文介绍了 HeaderGen，这是一种基于工具的方法，它能自动在这些笔记本的代码单元中添加描述性的 Markdown 标头，这些标头来自预定义的机器学习操作分类。此外，它还能根据该分类法对函数调用进行系统分类和显示。支持 HeaderGen 的机制是一种增强型调用图分析技术，它建立在 PyCG 的基础分析之上。为了提高精确度，HeaderGen 利用外部函数调用的返回类型解析、类型推断和流程敏感性扩展了 PyCG 的分析。此外，HeaderGen 还利用类型信息，在代码语法中采用模式匹配技术来注释代码单元。我们对来自 Kaggle 的 15 本真实 Jupyter 笔记本进行了实证评估。结果表明，调用图分析的准确率很高，精确度为 95.6%，召回率为 95.3%。与专家手动创建的标题相比，标题生成的精确率为 85.7%，召回率为 92.8%。一项用户研究证实了 HeaderGen 的实用性，用户发现 HeaderGen 在与理解和导航相关的任务中非常有用。为了进一步评估静态分析工具的类型推断能力，我们引入了 TypeEvalPy，这是一个用于评估 Python 类型推断工具的框架，内置的微基准测试包含 154 个代码片段和 845 个基本事实类型注释。我们对四种工具的比较分析表明，HeaderGen 在与地面实况精确匹配方面优于其他工具。

{"title":"Static analysis driven enhancements for comprehension in machine learning notebooks","authors":"Ashwin Prasad Shivarpatna Venkatesh, Samkutty Sabu, Mouli Chekkapalli, Jiawei Wang, Li Li, Eric Bodden","doi":"10.1007/s10664-024-10525-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10525-w","url":null,"abstract":"<p>Jupyter notebooks have emerged as the predominant tool for data scientists to develop and share machine learning solutions, primarily using Python as the programming language. Despite their widespread adoption, a significant fraction of these notebooks, when shared on public repositories, suffer from insufficient documentation and a lack of coherent narrative. Such shortcomings compromise the readability and understandability of the notebook. Addressing this shortcoming, this paper introduces <span>HeaderGen</span>, a tool-based approach that automatically augments code cells in these notebooks with descriptive markdown headers, derived from a predefined taxonomy of machine learning operations. Additionally, it systematically classifies and displays function calls in line with this taxonomy. The mechanism that powers <span>HeaderGen</span> is an enhanced call graph analysis technique, building upon the foundational analysis available in <i>PyCG</i>. To improve precision, <span>HeaderGen</span> extends PyCG’s analysis with return-type resolution of external function calls, type inference, and flow-sensitivity. Furthermore, leveraging type information, <span>HeaderGen</span> employs pattern matching techniques on the code syntax to annotate code cells. We conducted an empirical evaluation on 15 real-world Jupyter notebooks sourced from Kaggle. The results indicate a high accuracy in call graph analysis, with precision at 95.6% and recall at 95.3%. The header generation has a precision of 85.7% and a recall rate of 92.8% with regard to headers created manually by experts. A user study corroborated the practical utility of <span>HeaderGen</span>, revealing that users found <span>HeaderGen</span> useful in tasks related to comprehension and navigation. To further evaluate the type inference capability of static analysis tools, we introduce <span>TypeEvalPy</span>, a framework for evaluating type inference tools for Python with an in-built micro-benchmark containing 154 code snippets and 845 type annotations in the ground truth. Our comparative analysis on four tools revealed that <span>HeaderGen</span> outperforms other tools in exact matches with the ground truth.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"34 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Causal inference of server- and client-side code smells in web apps evolution 网络应用程序演进中服务器和客户端代码气味的因果推理

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-08-05 DOI: 10.1007/s10664-024-10478-0

Américo Rio, Fernando Brito e Abreu, Diana Mendes

Context

Code smells (CS) are symptoms of poor design and implementation choices that may lead to increased defect incidence, decreased code comprehension, and longer times to release. Web applications and systems are seldom studied, probably due to the heterogeneity of platforms (server and client-side) and languages, and to study web code smells, we need to consider CS covering that diversity. Furthermore, the literature provides little evidence for the claim that CS are a symptom of poor design, leading to future problems in web apps.

Objective

To study the quantitative evolution and inner relationship of CS in web apps on the server- and client-sides, and their impact on maintainability and app time-to-release (TTR).

Method

We collected and analyzed 18 server-side, and 12 client-side code smells, aka web smells, from consecutive official releases of 12 PHP typical web apps, i.e., with server- and client-code in the same code base, summing 811 releases. Additionally, we collected metrics, maintenance issues, reported bugs, and release dates. We used several methodologies to devise causality relationships among the considered irregular time series, such as Granger-causality and Information Transfer Entropy(TE) with CS from previous one to four releases (lag 1 to 4).

Results

The CS typically evolve the same way inside their group and its possible to analyze them as groups. The CS group trends are: Server, slowly decreasing; Client-side embed, decreasing and JavaScript,increasing. Studying the relationship between CS groups we found that the "lack of code quality", measured with CS density proxies, propagates from client code to server code and JavaScript in half of the applications. We found causality relationships between CS and issues. We also found causality from CS groups to bugs in Lag 1, decreasing in the subsequent lags. The values are 15% (lag1), 10% (lag2), and then decrease. The group of client-side embed CS still impacts up to 3 releases before. In group analysis, server-side CS and JavaScript contribute more to bugs. There are causality relationships from individual CS to TTR on lag 1, decreasing on lag 2, and from all CS groups to TTR in lag1, decreasing in the other lags, except for client CS.

Conclusions

There is statistical inference between CS groups. There is also evidence of statistical inference from the CS to web applications’ issues, bugs, and TTR. Client and server-side CS contribute globally to the quality of web applications, this contribution is low, but significant. Depending on the outcome variable (issues, bugs, time-to-release), the contribution quantity from CS is between 10% and 20%.

背景代码气味（CS）是设计和实施选择不当的表现，可能导致缺陷发生率增加、代码理解能力下降以及发布时间延长。可能是由于平台（服务器和客户端）和语言的异质性，网络应用程序和系统很少被研究，而要研究网络代码气味，我们需要考虑涵盖这种多样性的 CS。此外，对于 CS 是设计不佳的症状，会导致网络应用程序未来出现问题的说法，文献中几乎没有提供证据。方法我们从连续发布的 12 个 PHP 典型网络应用程序（即：服务器端和客户端代码均为 PHP 代码的网络应用程序）中，收集并分析了 18 个服务器端和 12 个客户端代码气味（又称网络气味）、即在同一代码库中包含服务器代码和客户端代码，共计 811 个版本。此外，我们还收集了指标、维护问题、报告的错误和发布日期。我们使用了几种方法来确定所考虑的不规则时间序列之间的因果关系，如格兰杰因果关系和信息传递熵（TE），以及 CS 的前一至四个版本（滞后 1 至 4）。CS 组的趋势如下服务器，缓慢减少；客户端嵌入，减少；JavaScript，增加。在研究 CS 组之间的关系时，我们发现在一半的应用程序中，用 CS 密度代理衡量的 "代码质量缺失 "会从客户端代码传播到服务器代码和 JavaScript。我们发现了 CS 与问题之间的因果关系。我们还发现，在滞后期 1，CS 组与 Bug 之间存在因果关系，在随后的滞后期中，因果关系逐渐减弱。数值分别为 15%（滞后期 1）、10%（滞后期 2），然后依次递减。客户端嵌入式 CS 组仍然影响到 3 个版本之前。在分组分析中，服务器端 CS 和 JavaScript 对 Bug 的影响更大。除客户端 CS 外，单个 CS 与 TTR 的因果关系在滞后期 1 存在，在滞后期 2 递减；所有 CS 组与 TTR 的因果关系在滞后期 1 存在，在其他滞后期递减。还有证据表明，从 CS 到网络应用程序的问题、错误和 TTR 都存在统计推论。客户端和服务器端 CS 对网络应用程序的质量有全面的贡献，虽然贡献不大，但意义重大。根据结果变量（问题、错误、发布时间）的不同，CS 的贡献率在 10% 到 20% 之间。

{"title":"Causal inference of server- and client-side code smells in web apps evolution","authors":"Américo Rio, Fernando Brito e Abreu, Diana Mendes","doi":"10.1007/s10664-024-10478-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10478-0","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Code smells (CS) are symptoms of poor design and implementation choices that may lead to increased defect incidence, decreased code comprehension, and longer times to release. Web applications and systems are seldom studied, probably due to the heterogeneity of platforms (server and client-side) and languages, and to study web code smells, we need to consider CS covering that diversity. Furthermore, the literature provides little evidence for the claim that CS are a symptom of poor design, leading to future problems in web apps.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>To study the quantitative evolution and inner relationship of CS in web apps on the server- and client-sides, and their impact on maintainability and app time-to-release (TTR).</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We collected and analyzed 18 server-side, and 12 client-side code smells, aka web smells, from consecutive official releases of 12 PHP typical web apps, i.e., with server- and client-code in the same code base, summing 811 releases. Additionally, we collected metrics, maintenance issues, reported bugs, and release dates. We used several methodologies to devise causality relationships among the considered irregular time series, such as Granger-causality and Information Transfer Entropy(TE) with CS from previous one to four releases (lag 1 to 4).</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The CS typically evolve the same way inside their group and its possible to analyze them as groups. The CS group trends are: Server, slowly decreasing; Client-side embed, decreasing and JavaScript,increasing. Studying the relationship between CS groups we found that the \"lack of code quality\", measured with CS density proxies, propagates from client code to server code and JavaScript in half of the applications. We found causality relationships between CS and issues. We also found causality from CS groups to bugs in Lag 1, decreasing in the subsequent lags. The values are 15% (lag1), 10% (lag2), and then decrease. The group of client-side embed CS still impacts up to 3 releases before. In group analysis, server-side CS and JavaScript contribute more to bugs. There are causality relationships from individual CS to TTR on lag 1, decreasing on lag 2, and from all CS groups to TTR in lag1, decreasing in the other lags, except for client CS.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>There is statistical inference between CS groups. There is also evidence of statistical inference from the CS to web applications’ issues, bugs, and TTR. Client and server-side CS contribute globally to the quality of web applications, this contribution is low, but significant. Depending on the outcome variable (issues, bugs, time-to-release), the contribution quantity from CS is between 10% and 20%.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"76 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools 关于代码审查员接受自动程序修复工具建议的候选安全补丁的问题

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-08-03 DOI: 10.1007/s10664-024-10506-z

Aurora Papotti, Ranindya Paramitha, Fabio Massacci

Objective

We investigated whether (possibly wrong) security patches suggested by Automated Program Repairs (APR) for real world projects are recognized by human reviewers. We also investigated whether knowing that a patch was produced by an allegedly specialized tool does change the decision of human reviewers.

Method

We perform an experiment with (n= 72) Master students in Computer Science. In the first phase, using a balanced design, we propose to human reviewers a combination of patches proposed by APR tools for different vulnerabilities and ask reviewers to adopt or reject the proposed patches. In the second phase, we tell participants that some of the proposed patches were generated by security-specialized tools (even if the tool was actually a ‘normal’ APR tool) and measure whether the human reviewers would change their decision to adopt or reject a patch.

Results

It is easier to identify wrong patches than correct patches, and correct patches are not confused with partially correct patches. Also patches from APR Security tools are adopted more often than patches suggested by generic APR tools but there is not enough evidence to verify if ‘bogus’ security claims are distinguishable from ‘true security’ claims. Finally, the number of switches to the patches suggested by security tool is significantly higher after the security information is revealed irrespective of correctness.

Limitations

The experiment was conducted in an academic setting, and focused on a limited sample of popular APR tools and popular vulnerability types.

目的我们研究了自动程序修复（APR）为真实世界的项目建议的（可能是错误的）安全补丁是否能被人类审查员识别。我们还调查了知道补丁是由所谓的专业工具制作的是否会改变人类审查员的决定。方法我们与计算机科学专业的硕士生（72 人）进行了一项实验。在第一阶段，我们采用平衡设计，针对不同的漏洞，向人类审查者提出由APR工具提出的补丁组合，并要求审查者采纳或拒绝所提出的补丁。在第二阶段，我们告诉参与者所提出的补丁中有一部分是由安全专业工具生成的（即使该工具实际上是一个 "普通 "的 APR 工具），并测量人类审查员是否会改变他们采用或拒绝补丁的决定。结果识别错误补丁比识别正确补丁更容易，正确补丁与部分正确补丁不会混淆。此外，来自 APR 安全工具的补丁比通用 APR 工具建议的补丁更容易被采用，但没有足够的证据来验证 "假 "安全声明与 "真安全 "声明是否可以区分。最后，无论安全信息的正确与否，在安全工具建议的补丁被揭示后，改用安全工具建议的补丁的次数明显较多。

{"title":"On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools","authors":"Aurora Papotti, Ranindya Paramitha, Fabio Massacci","doi":"10.1007/s10664-024-10506-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10506-z","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Objective</h3><p>We investigated whether (possibly wrong) security patches suggested by Automated Program Repairs (APR) for real world projects are recognized by human reviewers. We also investigated whether knowing that a patch was produced by an allegedly specialized tool does change the decision of human reviewers.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>We perform an experiment with <span>(n= 72)</span> Master students in Computer Science. In the first phase, using a balanced design, we propose to human reviewers a combination of patches proposed by APR tools for different vulnerabilities and ask reviewers to adopt or reject the proposed patches. In the second phase, we tell participants that some of the proposed patches were generated by security-specialized tools (even if the tool was actually a ‘normal’ APR tool) and measure whether the human reviewers would change their decision to adopt or reject a patch.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>It is easier to identify wrong patches than correct patches, and correct patches are not confused with partially correct patches. Also patches from APR Security tools are adopted more often than patches suggested by generic APR tools but there is not enough evidence to verify if ‘bogus’ security claims are distinguishable from ‘true security’ claims. Finally, the number of switches to the patches suggested by security tool is significantly higher after the security information is revealed irrespective of correctness.</p><h3 data-test=\"abstract-sub-heading\">Limitations</h3><p>The experiment was conducted in an academic setting, and focused on a limited sample of popular APR tools and popular vulnerability types.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"52 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IRJIT: A simple, online, information retrieval approach for just-in-time software defect prediction IRJIT：用于及时软件缺陷预测的简单在线信息检索方法

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-08-02 DOI: 10.1007/s10664-024-10514-z

Hareem Sahar, Abdul Ali Bangash, Abram Hindle, Denilson Barbosa

Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time. Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models. These models typically involve extensive training processes that may require significant computational resources and time. These characteristics can pose challenges when attempting to update the models in real-time as new examples become available, potentially impacting their suitability for fast online defect prediction. Furthermore, the reliance on a complex underlying model makes these approaches often less explainable, which means the developers cannot understand the reasons behind models’ predictions. An approach that is not explainable might not be adopted in real-life development environments because of developers’ lack of trust in its results. To address these limitations, we propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits. IRJIT approach is online and explainable as it can learn from new data without expensive retraining, and developers can see the documents that support a prediction, providing additional context. By evaluating 10 open-source datasets in a within project setting, we show that our approach is up to 112 times faster than the state-of-the-art ML and DL approaches, offers explainability at the commit and line level, and has comparable performance to the state-of-the-art.

准时软件缺陷预测（JIT-SDP）通过在提交签入时识别缺陷，防止将缺陷引入软件。目前的软件缺陷预测方法依赖于人工制作的特征，如变更度量，以及昂贵的机器学习或深度学习模型训练。这些模型通常涉及大量的训练过程，可能需要大量的计算资源和时间。当有新实例可用时，试图实时更新模型时，这些特征可能会带来挑战，从而可能影响它们对快速在线缺陷预测的适用性。此外，由于依赖于复杂的底层模型，这些方法的可解释性往往较差，这意味着开发人员无法理解模型预测背后的原因。由于开发人员对其结果缺乏信任，无法解释的方法可能不会在现实开发环境中被采用。为了解决这些局限性，我们提出了一种名为 IRJIT 的方法，它采用源代码信息检索，并根据新提交与过去的错误或干净提交的相似度将其标记为错误或干净。IRJIT 方法是在线的、可解释的，因为它可以从新数据中学习，无需昂贵的再训练，而且开发人员可以看到支持预测的文档，从而提供额外的上下文。通过在项目环境中评估 10 个开源数据集，我们发现我们的方法比最先进的 ML 和 DL 方法快 112 倍，在提交和行级别提供了可解释性，性能与最先进的方法相当。

{"title":"IRJIT: A simple, online, information retrieval approach for just-in-time software defect prediction","authors":"Hareem Sahar, Abdul Ali Bangash, Abram Hindle, Denilson Barbosa","doi":"10.1007/s10664-024-10514-z","DOIUrl":"https://doi.org/10.1007/s10664-024-10514-z","url":null,"abstract":"<p>Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time. Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models. These models typically involve extensive training processes that may require significant computational resources and time. These characteristics can pose challenges when attempting to update the models in real-time as new examples become available, potentially impacting their suitability for fast online defect prediction. Furthermore, the reliance on a complex underlying model makes these approaches often less <i>explainable</i>, which means the developers cannot understand the reasons behind models’ predictions. An approach that is not <i>explainable</i> might not be adopted in real-life development environments because of developers’ lack of trust in its results. To address these limitations, we propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits. IRJIT approach is <i>online</i> and <i>explainable</i> as it can learn from new data without expensive retraining, and developers can see the documents that support a prediction, providing additional context. By evaluating 10 open-source datasets in a within project setting, we show that our approach is up to 112 times faster than the state-of-the-art ML and DL approaches, offers explainability at the commit and line level, and has comparable performance to the state-of-the-art.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"182 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Industrial adoption of machine learning techniques for early identification of invalid bug reports 工业界采用机器学习技术，及早识别无效错误报告

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-07-31 DOI: 10.1007/s10664-024-10502-3

Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, Emelie Engström

Despite the accuracy of machine learning (ML) techniques in predicting invalid bug reports, as shown in earlier research, and the importance of early identification of invalid bug reports in software maintenance, the adoption of ML techniques for this task in industrial practice is yet to be investigated. In this study, we used a technology transfer model to guide the adoption of an ML technique at a company for the early identification of invalid bug reports. In the process, we also identify necessary conditions for adopting such techniques in practice. We followed a case study research approach with various design and analysis iterations for technology transfer activities. We collected data from bug repositories, through focus groups, a questionnaire, and a presentation and feedback session with an expert. As expected, we found that an ML technique can identify invalid bug reports with acceptable accuracy at an early stage. However, the technique’s accuracy drops over time in its operational use due to changes in the product, the used technologies, or the development organization. Such changes may require retraining the ML model. During validation, practitioners highlighted the need to understand the ML technique’s predictions to trust the predictions. We found that a visual (using a state-of-the-art ML interpretation framework) and descriptive explanation of the prediction increases the trustability of the technique compared to just presenting the results of the validity predictions. We conclude that trustability, integration with the existing toolchain, and maintaining the techniques’ accuracy over time are critical for increasing the likelihood of adoption.

尽管早先的研究表明，机器学习（ML）技术在预测无效错误报告方面具有很高的准确性，而且早期识别无效错误报告在软件维护中也非常重要，但在工业实践中采用 ML 技术来完成这项任务还有待研究。在本研究中，我们使用技术转移模型来指导一家公司采用 ML 技术来早期识别无效错误报告。在此过程中，我们还确定了在实践中采用此类技术的必要条件。我们采用案例研究的方法，对技术转让活动进行了各种设计和分析迭代。我们从错误库中收集数据，通过焦点小组、问卷调查以及与专家的演示和反馈会议。不出所料，我们发现人工智能技术可以在早期阶段以可接受的准确度识别出无效的错误报告。然而，随着时间的推移，由于产品、所用技术或开发组织的变化，该技术的准确性会在实际使用中下降。这种变化可能需要重新训练 ML 模型。在验证过程中，实践者强调需要理解 ML 技术的预测，以便信任其预测结果。我们发现，对预测进行可视化（使用最先进的 ML 解释框架）和描述性解释，比仅仅展示有效性预测结果更能提高技术的可信度。我们的结论是，可信任度、与现有工具链的整合以及长期保持技术的准确性对于提高采用的可能性至关重要。

{"title":"Industrial adoption of machine learning techniques for early identification of invalid bug reports","authors":"Muhammad Laiq, Nauman bin Ali, Jürgen Börstler, Emelie Engström","doi":"10.1007/s10664-024-10502-3","DOIUrl":"https://doi.org/10.1007/s10664-024-10502-3","url":null,"abstract":"<p>Despite the accuracy of machine learning (ML) techniques in predicting invalid bug reports, as shown in earlier research, and the importance of early identification of invalid bug reports in software maintenance, the adoption of ML techniques for this task in industrial practice is yet to be investigated. In this study, we used a technology transfer model to guide the adoption of an ML technique at a company for the early identification of invalid bug reports. In the process, we also identify necessary conditions for adopting such techniques in practice. We followed a case study research approach with various design and analysis iterations for technology transfer activities. We collected data from bug repositories, through focus groups, a questionnaire, and a presentation and feedback session with an expert. As expected, we found that an ML technique can identify invalid bug reports with acceptable accuracy at an early stage. However, the technique’s accuracy drops over time in its operational use due to changes in the product, the used technologies, or the development organization. Such changes may require retraining the ML model. During validation, practitioners highlighted the need to understand the ML technique’s predictions to trust the predictions. We found that a visual (using a state-of-the-art ML interpretation framework) and descriptive explanation of the prediction increases the trustability of the technique compared to just presenting the results of the validity predictions. We conclude that trustability, integration with the existing toolchain, and maintaining the techniques’ accuracy over time are critical for increasing the likelihood of adoption.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"42 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0