Empirical Software Engineering最新文献_第10页

An empirical study of untangling patterns of two-class dependency cycles 解开两类依存循环模式的实证研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-03-12 DOI: 10.1007/s10664-023-10438-0

Qiong Feng, Shuwen Liu, Huan Ji, Xiaotian Ma, Peng Liang

Dependency cycles pose a significant challenge to software quality and maintainability. However, there is limited understanding of how practitioners resolve dependency cycles in real-world scenarios. This paper presents an empirical study investigating the recurring patterns employed by software developers to resolve dependency cycles between two classes in practice. We analyzed the data from 38 open-source projects across different domains and manually inspected hundreds of cycle untangling cases. Our findings reveal that developers tend to employ five recurring patterns to address dependency cycles. The chosen patterns are not only determined by dependency relations between cyclic classes, but also highly related to their design context, i.e., how cyclic classes depend on or are depended by their neighbor classes. Through this empirical study, we also discovered three common counterintuitive solutions developers usually adopted during cycles’ handling. These recurring patterns and common counterintuitive solutions observed in dependency cycles’ practice can serve as a taxonomy to improve developers’ awareness and also be used as learning materials for students in software engineering and inexperienced developers. Our results also suggest that, in addition to considering the internal structure of dependency cycles, automatic tools need to consider the design context of cycles to provide better support for refactoring dependency cycles.

依赖性循环对软件质量和可维护性构成重大挑战。然而，人们对实践者如何在现实世界中解决依赖循环问题的了解还很有限。本文介绍了一项实证研究，调查了软件开发人员在实践中解决两个类之间的依赖循环所采用的重复出现模式。我们分析了来自 38 个不同领域的开源项目的数据，并手动检查了数百个循环解缠案例。我们的研究结果表明，开发人员倾向于使用五种重复出现的模式来解决依赖循环问题。所选模式不仅由循环类之间的依赖关系决定，还与设计环境高度相关，即循环类如何依赖或被其邻近类依赖。通过实证研究，我们还发现了开发人员在处理循环时通常采用的三种反直觉解决方案。在依赖循环实践中观察到的这些重复出现的模式和常见的反直觉解决方案可以作为分类标准，提高开发人员的意识，也可以作为软件工程专业学生和缺乏经验的开发人员的学习材料。我们的研究结果还表明，除了考虑依赖循环的内部结构外，自动工具还需要考虑依赖循环的设计环境，以便为重构依赖循环提供更好的支持。

{"title":"An empirical study of untangling patterns of two-class dependency cycles","authors":"Qiong Feng, Shuwen Liu, Huan Ji, Xiaotian Ma, Peng Liang","doi":"10.1007/s10664-023-10438-0","DOIUrl":"https://doi.org/10.1007/s10664-023-10438-0","url":null,"abstract":"<p>Dependency cycles pose a significant challenge to software quality and maintainability. However, there is limited understanding of how practitioners resolve dependency cycles in real-world scenarios. This paper presents an empirical study investigating the recurring patterns employed by software developers to resolve dependency cycles between two classes in practice. We analyzed the data from 38 open-source projects across different domains and manually inspected hundreds of cycle untangling cases. Our findings reveal that developers tend to employ five recurring patterns to address dependency cycles. The chosen patterns are not only determined by dependency relations between cyclic classes, but also highly related to their design context, i.e., how cyclic classes depend on or are depended by their neighbor classes. Through this empirical study, we also discovered three common counterintuitive solutions developers usually adopted during cycles’ handling. These recurring patterns and common counterintuitive solutions observed in dependency cycles’ practice can serve as a taxonomy to improve developers’ awareness and also be used as learning materials for students in software engineering and inexperienced developers. Our results also suggest that, in addition to considering the internal structure of dependency cycles, automatic tools need to consider the design context of cycles to provide better support for refactoring dependency cycles.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"12 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140115349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning-based test smell detection 基于机器学习的测试气味检测

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-03-05 DOI: 10.1007/s10664-023-10436-2

Valeria Pontillo, Dario Amoroso d’Aragona, Fabiano Pecorelli, Dario Di Nucci, Filomena Ferrucci, Fabio Palomba

Test smells are symptoms of sub-optimal design choices adopted when developing test cases. Previous studies have proved their harmfulness for test code maintainability and effectiveness. Therefore, researchers have been proposing automated, heuristic-based techniques to detect them. However, the performance of these detectors is still limited and dependent on tunable thresholds. We design and experiment with a novel test smell detection approach based on machine learning to detect four test smells. First, we develop the largest dataset of manually-validated test smells to enable experimentation. Afterward, we train six machine learners and assess their capabilities in within- and cross-project scenarios. Finally, we compare the ML-based approach with state-of-the-art heuristic-based techniques. The key findings of the study report a negative result. The performance of the machine learning-based detector is significantly better than heuristic-based techniques, but none of the learners able to overcome an average F-Measure of 51%. We further elaborate and discuss the reasons behind this negative result through a qualitative investigation into the current issues and challenges that prevent the appropriate detection of test smells, which allowed us to catalog the next steps that the research community may pursue to improve test smell detection techniques.

测试气味是开发测试用例时所采用的次优设计选择的表现。以往的研究已经证明了它们对测试代码可维护性和有效性的危害。因此，研究人员提出了基于启发式的自动化技术来检测它们。然而，这些检测器的性能仍然有限，而且依赖于可调整的阈值。我们设计并实验了一种基于机器学习的新型测试气味检测方法，用于检测四种测试气味。首先，我们开发了最大的人工验证测试气味数据集，以便进行实验。然后，我们训练了六种机器学习器，并评估了它们在项目内和跨项目情况下的能力。最后，我们将基于 ML 的方法与最先进的启发式技术进行了比较。研究的主要发现报告了一个负面结果。基于机器学习的检测器的性能明显优于基于启发式的技术，但没有一个学习器能够克服平均 51% 的 F-Measure。我们通过对当前阻碍适当检测测试气味的问题和挑战的定性调查，进一步阐述和讨论了这一负面结果背后的原因，从而为研究界改进测试气味检测技术的下一步工作编制了目录。

{"title":"Machine learning-based test smell detection","authors":"Valeria Pontillo, Dario Amoroso d’Aragona, Fabiano Pecorelli, Dario Di Nucci, Filomena Ferrucci, Fabio Palomba","doi":"10.1007/s10664-023-10436-2","DOIUrl":"https://doi.org/10.1007/s10664-023-10436-2","url":null,"abstract":"<p>Test smells are symptoms of sub-optimal design choices adopted when developing test cases. Previous studies have proved their harmfulness for test code maintainability and effectiveness. Therefore, researchers have been proposing automated, heuristic-based techniques to detect them. However, the performance of these detectors is still limited and dependent on tunable thresholds. We design and experiment with a novel test smell detection approach based on machine learning to detect four test smells. First, we develop the largest dataset of manually-validated test smells to enable experimentation. Afterward, we train six machine learners and assess their capabilities in within- and cross-project scenarios. Finally, we compare the ML-based approach with state-of-the-art heuristic-based techniques. The key findings of the study report a negative result. The performance of the machine learning-based detector is significantly better than heuristic-based techniques, but none of the learners able to overcome an average F-Measure of 51%. We further elaborate and discuss the reasons behind this negative result through a qualitative investigation into the current issues and challenges that prevent the appropriate detection of test smells, which allowed us to catalog the next steps that the research community may pursue to improve test smell detection techniques.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating the readability of test code 调查测试代码的可读性

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-02-26 DOI: 10.1007/s10664-023-10390-z

Abstract

Context

The readability of source code is key for understanding and maintaining software systems and tests. Although several studies investigate the readability of source code, there is limited research specifically on the readability of test code and related influence factors.

Objective

In this paper, we aim at investigating the factors that influence the readability of test code from an academic perspective based on scientific literature sources and complemented by practical views, as discussed in grey literature.

Methods

First, we perform a Systematic Mapping Study (SMS) with a focus on scientific literature. Second, we extend this study by reviewing grey literature sources for practical aspects on test code readability and understandability. Finally, we conduct a controlled experiment on the readability of a selected set of test cases to collect additional knowledge on influence factors discussed in practice.

Results

The result set of the SMS includes 19 primary studies from the scientific literature for further analysis. The grey literature search reveals 62 sources for information on test code readability. Based on an analysis of these sources, we identified a combined set of 14 factors that influence the readability of test code. 7 of these factors were found in scientific and grey literature, while some factors were mainly discussed in academia (2) or industry (5) with only limited overlap. The controlled experiment on practically relevant influence factors showed that the investigated factors have a significant impact on readability for half of the selected test cases.

Conclusion

Our review of scientific and grey literature showed that test code readability is of interest for academia and industry with a consensus on key influence factors. However, we also found factors only discussed by practitioners. For some of these factors we were able to confirm an impact on readability in a first experiment. Therefore, we see the need to bring together academic and industry viewpoints to achieve a common view on the readability of software test code.

摘要背景源代码的可读性是理解和维护软件系统和测试的关键。尽管有多项研究对源代码的可读性进行了调查，但专门针对测试代码的可读性及相关影响因素的研究却十分有限。本文旨在以科学文献为基础，以灰色文献中讨论的实用观点为补充，从学术角度研究影响测试代码可读性的因素。方法首先，我们以科学文献为重点，开展了一项系统映射研究（SMS）。其次，我们通过查阅灰色文献资料来扩展这项研究，以了解测试代码可读性和可理解性的实用方面。最后，我们对所选测试用例集的可读性进行了对照实验，以收集有关实践中讨论的影响因素的更多知识。结果 SMS 的结果集包括 19 项供进一步分析的科学文献中的主要研究。灰色文献检索发现了 62 个有关测试代码可读性的信息来源。根据对这些资料的分析，我们确定了 14 个影响测试代码可读性的因素。其中 7 个因素出现在科学和灰色文献中，而一些因素主要在学术界（2 个）或工业界（5 个）中讨论，只有有限的重叠。对实际相关影响因素的对照实验表明，所调查的因素对半数选定测试用例的可读性有显著影响。结论我们对科学文献和灰色文献的研究表明，测试代码的可读性是学术界和工业界关注的问题，并就关键影响因素达成了共识。不过，我们也发现了一些仅由从业人员讨论的因素。对于其中一些因素，我们能够在首次实验中确认其对可读性的影响。因此，我们认为有必要将学术界和业界的观点结合起来，就软件测试代码的可读性达成共识。

{"title":"Investigating the readability of test code","authors":"","doi":"10.1007/s10664-023-10390-z","DOIUrl":"https://doi.org/10.1007/s10664-023-10390-z","url":null,"abstract":"<h3>Abstract</h3> <span> <h3>Context</h3> <p>The readability of source code is key for understanding and maintaining software systems and tests. Although several studies investigate the readability of source code, there is limited research specifically on the readability of test code and related influence factors.</p> </span> <span> <h3>Objective</h3> <p>In this paper, we aim at investigating the factors that influence the readability of test code from an academic perspective based on scientific literature sources and complemented by practical views, as discussed in grey literature.</p> </span> <span> <h3>Methods</h3> <p>First, we perform a Systematic Mapping Study (SMS) with a focus on scientific literature. Second, we extend this study by reviewing grey literature sources for practical aspects on test code readability and understandability. Finally, we conduct a controlled experiment on the readability of a selected set of test cases to collect additional knowledge on influence factors discussed in practice.</p> </span> <span> <h3>Results</h3> <p>The result set of the SMS includes 19 primary studies from the scientific literature for further analysis. The grey literature search reveals 62 sources for information on test code readability. Based on an analysis of these sources, we identified a combined set of 14 factors that influence the readability of test code. 7 of these factors were found in scientific <em>and</em> grey literature, while some factors were mainly discussed in academia (2) <em>or</em> industry (5) with only limited overlap. The controlled experiment on practically relevant influence factors showed that the investigated factors have a significant impact on readability for half of the selected test cases.</p> </span> <span> <h3>Conclusion</h3> <p>Our review of scientific and grey literature showed that test code readability is of interest for academia and industry with a consensus on key influence factors. However, we also found factors only discussed by practitioners. For some of these factors we were able to confirm an impact on readability in a first experiment. Therefore, we see the need to bring together academic and industry viewpoints to achieve a common view on the readability of software test code.</p> </span>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"2016 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139980044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

When less is more: on the value of “co-training” for semi-supervised software defect predictors 少即是多：半监督软件缺陷预测器的 "共同训练 "价值

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-02-24 DOI: 10.1007/s10664-023-10418-4

Suvodeep Majumder, Joymallya Chakraborty, Tim Menzies

Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects– and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised “co-training methods” work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user’s specific goals. Also, we warn that a commonly-used co-training method (“multi-view”– where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.

标记模块是否有缺陷是一项昂贵的任务。因此，用于训练的标注数据往往有限。半监督分类器在训练模型时使用的标签数量要少得多。然而，半监督方法有很多，包括自标记法、联合训练法、最大边际法和基于图的方法等等。其中只有少数几种方法在 SE 中进行过预测缺陷等方面的测试，即便如此，这些方法也只在少数几个项目中进行过测试。本文在超过 714 个项目中应用了 55 种半监督学习器。我们发现，半监督 "联合训练方法 "的效果明显优于其他方法。具体来说，只需标注 2.5% 的数据，就能做出与使用 100% 数据的预测相媲美的预测。尽管如此，协同训练仍需谨慎使用，因为协同训练方法的具体选择需要根据用户的具体目标来谨慎选择。此外，我们还警告说，一种常用的联合训练方法（"多视图"--不同的学习者获得不同的列集）并不能提高预测效果（同时运行时间成本增加过多，分别为 11 小时和 1.8 小时）。测试软件分析的其他领域是否也能实现这些改进是一个有待解决的问题，值得在今后的工作中加以研究。为了帮助探索其他领域，所有使用的代码都可以在 https://github.com/ai-se/Semi-Supervised 上找到。

{"title":"When less is more: on the value of “co-training” for semi-supervised software defect predictors","authors":"Suvodeep Majumder, Joymallya Chakraborty, Tim Menzies","doi":"10.1007/s10664-023-10418-4","DOIUrl":"https://doi.org/10.1007/s10664-023-10418-4","url":null,"abstract":"<p>Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects– and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised “co-training methods” work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user’s specific goals. Also, we warn that a commonly-used co-training method (“multi-view”– where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"27 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Traceability and reuse mechanisms, the most important properties of model transformation languages 可追溯性和重用机制是模型转换语言最重要的特性

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-02-24 DOI: 10.1007/s10664-023-10428-2

Stefan Höppner, Matthias Tichy

Context

Dedicated model transformation languages are claimed to provide many benefits over the use of general purpose languages for developing model transformations. However, the actual advantages and disadvantages associated with the use of model transformation languages are poorly understood empirically. There is little knowledge and even less empirical assessment about what advantages and disadvantages hold in which cases and where they originate from. In a prior interview study, we elicited expert opinions on what advantages result from what factors surrounding model transformation languages as well as a number of moderating factors that moderate the influence.

Objective

We aim to quantitatively asses the interview results to confirm or reject the influences and moderation effects posed by different factors. We further intend to gain insights into how valuable different factors are to the discussion so that future studies can draw on these data for designing targeted and relevant studies.

Method

We gather data on the factors and quality attributes using an online survey. To analyse the data and examine the hypothesised influences and moderations, we use universal structure modelling based on a structural equation model. Universal structure modelling produces significance values and path coefficients for each hypothesised and modelled interdependence between factors and quality attributes that can be used to confirm or reject correlation and to weigh the strength of influence present.

Results

We analyzed 113 responses. The results show that the MTL capabilities Tracing and Reuse Mechanisms are most important overall. Though the observed effects were generally 10 times lower than anticipated. Furthermore, we found that moderation effects need to be individually assessed for each influence on a quality attribute. The moderation effects of a single moderating variable vary significantly for each influence, with the strongest effects being 1000 times higher than the weakest.

Conclusion

The empirical assessment of MTLs is a complex topic that cannot be solved by looking at a single stand-alone factor. Our results provide clear indication that evaluation should consider transformations of different sizes and use-cases that go beyond mapping one elements attributes to another. Language development on the other hand should focus on providing practical, transformation specific reuse mechanisms that allow MTLs to excel in areas such as maintainability and productivity compared to GPLs.

与使用通用语言开发模型转换相比，专用模型转换语言据称具有许多优势。然而，与使用模型转换语言相关的实际优缺点却鲜为人知。对于在哪些情况下使用模型转换语言会有哪些优势和劣势，以及这些优势和劣势的来源是什么，人们知之甚少，对其进行的经验评估更是少之又少。我们的目标是对访谈结果进行量化评估，以确认或否定不同因素的影响和调节作用。我们还打算深入了解不同因素对讨论的价值，以便今后的研究可以利用这些数据设计有针对性的相关研究。方法我们通过在线调查收集有关因素和质量属性的数据。为了分析数据并研究假设的影响因素和调节因素，我们在结构方程模型的基础上使用了通用结构模型。通用结构建模为每个假设和建模的因素与质量属性之间的相互依存关系生成显著值和路径系数，可用于确认或拒绝相关性，并权衡存在的影响强度。结果表明，总体而言， MTL 能力追踪和重复使用机制最为重要。尽管观察到的影响一般比预期低 10 倍。此外，我们还发现，需要对质量属性的每种影响单独评估调节效果。单个调节变量对每个影响因素的调节效果差异很大，最强的效果是最弱的效果的 1000 倍。我们的结果清楚地表明，评估应考虑不同规模的转换和使用情况，而不仅仅是将一种元素属性映射到另一种元素属性。另一方面，语言开发应侧重于提供实用的、针对特定转换的重用机制，使 MTL 在可维护性和生产率等方面优于 GPL。

{"title":"Traceability and reuse mechanisms, the most important properties of model transformation languages","authors":"Stefan Höppner, Matthias Tichy","doi":"10.1007/s10664-023-10428-2","DOIUrl":"https://doi.org/10.1007/s10664-023-10428-2","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">\u0000<b>Context</b>\u0000</h3><p>Dedicated model transformation languages are claimed to provide many benefits over the use of general purpose languages for developing model transformations. However, the actual advantages and disadvantages associated with the use of model transformation languages are poorly understood empirically. There is little knowledge and even less empirical assessment about what advantages and disadvantages hold in which cases and where they originate from. In a prior interview study, we elicited expert opinions on what advantages result from what factors surrounding model transformation languages as well as a number of moderating factors that moderate the influence.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Objective</b>\u0000</h3><p>We aim to quantitatively asses the interview results to confirm or reject the influences and moderation effects posed by different factors. We further intend to gain insights into how valuable different factors are to the discussion so that future studies can draw on these data for designing targeted and relevant studies.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Method</b>\u0000</h3><p>We gather data on the factors and quality attributes using an online survey. To analyse the data and examine the hypothesised influences and moderations, we use universal structure modelling based on a structural equation model. Universal structure modelling produces significance values and path coefficients for each hypothesised and modelled interdependence between factors and quality attributes that can be used to confirm or reject correlation and to weigh the strength of influence present.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Results</b>\u0000</h3><p>We analyzed 113 responses. The results show that the MTL capabilities Tracing and Reuse Mechanisms are most important overall. Though the observed effects were generally 10 times lower than anticipated. Furthermore, we found that moderation effects need to be individually assessed for each influence on a quality attribute. The moderation effects of a single moderating variable vary significantly for each influence, with the strongest effects being 1000 times higher than the weakest.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Conclusion</b>\u0000</h3><p>The empirical assessment of MTLs is a complex topic that cannot be solved by looking at a single stand-alone factor. Our results provide clear indication that evaluation should consider transformations of different sizes and use-cases that go beyond mapping one elements attributes to another. Language development on the other hand should focus on providing practical, transformation specific reuse mechanisms that allow MTLs to excel in areas such as maintainability and productivity compared to GPLs.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"42 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An empirical study of attack-related events in DeFi projects development DeFi 项目开发中与攻击相关事件的实证研究

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-02-23 DOI: 10.1007/s10664-024-10447-7

Dongming Xiang, Yuanchang Lin, Liming Nie, Yaowen Zheng, Zhengzi Xu, Zuohua Ding, Yang Liu

Decentralized Finance (DeFi) offers users decentralized financial services that are associated with the security of their assets. If DeFi is attacked, it could lead to considerable losses. Unfortunately, there is a lack of research on how DeFi developers respond to attacks during the development process. This lack of knowledge makes it difficult to identify which attacks to protect against and to create a comprehensive attack response system. This paper presents an empirical study to understand the current state of developers’ response to attacks during the development process. In addition, we conduct an analytical framework to help developers take preventive measures against attacks. Our research has revealed that Overflow Attack-related events are the most frequent (63, 19.75% of all attack-related events), and high-value DeFi projects tend to have more feedback and active development activities. We have observed that most of the attack instances (61, 85.92%) do not have corresponding attack-related development events, which can lead to a lack of trust between project teams and users if it is unclear whether the team responds to attacks. Furthermore, we have noticed that after the resolution of the same attack-related event, some attacks may recur, even though they could have been prevented. Consequently, we suggest some future research directions and provide some advice for DeFi project developers.

去中心化金融（DeFi）为用户提供与其资产安全相关的去中心化金融服务。如果 DeFi 遭到攻击，可能会导致巨大损失。遗憾的是，目前缺乏对 DeFi 开发人员在开发过程中如何应对攻击的研究。由于缺乏这方面的知识，很难确定要防范哪些攻击，也很难创建一个全面的攻击响应系统。本文通过实证研究来了解开发人员在开发过程中应对攻击的现状。此外，我们还提出了一个分析框架，以帮助开发人员采取攻击预防措施。我们的研究发现，溢出攻击相关事件最为频繁（63 起，占所有攻击相关事件的 19.75%），而高价值的 DeFi 项目往往有更多的反馈和积极的开发活动。我们注意到，大多数攻击实例（61 个，占 85.92%）都没有相应的攻击相关开发事件，如果不清楚团队是否对攻击做出响应，就会导致项目团队和用户之间缺乏信任。此外，我们还注意到，在解决了同一攻击相关事件后，一些攻击可能会再次发生，即使这些攻击本来是可以预防的。因此，我们提出了一些未来的研究方向，并为 DeFi 项目开发人员提供了一些建议。

{"title":"An empirical study of attack-related events in DeFi projects development","authors":"Dongming Xiang, Yuanchang Lin, Liming Nie, Yaowen Zheng, Zhengzi Xu, Zuohua Ding, Yang Liu","doi":"10.1007/s10664-024-10447-7","DOIUrl":"https://doi.org/10.1007/s10664-024-10447-7","url":null,"abstract":"<p>Decentralized Finance (DeFi) offers users decentralized financial services that are associated with the security of their assets. If DeFi is attacked, it could lead to considerable losses. Unfortunately, there is a lack of research on how DeFi developers respond to attacks during the development process. This lack of knowledge makes it difficult to identify which attacks to protect against and to create a comprehensive attack response system. This paper presents an empirical study to understand the current state of developers’ response to attacks during the development process. In addition, we conduct an analytical framework to help developers take preventive measures against attacks. Our research has revealed that Overflow Attack-related events are the most frequent (63, 19.75% of all attack-related events), and high-value DeFi projects tend to have more feedback and active development activities. We have observed that most of the attack instances (61, 85.92%) do not have corresponding attack-related development events, which can lead to a lack of trust between project teams and users if it is unclear whether the team responds to attacks. Furthermore, we have noticed that after the resolution of the same attack-related event, some attacks may recur, even though they could have been prevented. Consequently, we suggest some future research directions and provide some advice for DeFi project developers.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"24 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LineFlowDP: A Deep Learning-Based Two-Phase Approach for Line-Level Defect Prediction LineFlowDP：基于深度学习的线路级缺陷预测两阶段方法

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-02-23 DOI: 10.1007/s10664-023-10439-z

Fengyu Yang, Fa Zhong, Guangdong Zeng, Peng Xiao, Wei Zheng

Software defect prediction plays a key role in guiding resource allocation for software testing. However, previous defect prediction studies still have some limitations: (1) the granularity of defect prediction is still coarse, so high-risk code statements cannot be accurately located; (2) in fine-grained defect prediction, the semantic and structural information available in a single line of code is limited, and the content of code semantic information is not sufficient to achieve semantic differentiation. To address the above problems, we propose a two-phase line-level defect prediction method based on deep learning called LineFlowDP. We first extract the program dependency graph (PDG) of the source files. The lines of code corresponding to the nodes in the PDG are extended semantically with data flow and control flow information and embedded as nodes, and the model is further trained using an relational graph convolutional network. Finally, a graph interpreter GNNExplainer and a social network analysis method are used to rank the lines of code in the defective file according to risk. On 32 datasets from 9 projects, the experimental results show that LineFlowDP is 13%-404% more cost-effective than four state-of-the-art line-level defect prediction methods. The effectiveness of the flow information extension and code line risk ranking methods was also verified via ablation experiments.

软件缺陷预测在指导软件测试资源分配方面发挥着关键作用。然而，以往的缺陷预测研究仍存在一些局限性：（1）缺陷预测的粒度仍然较粗，无法准确定位高风险代码语句；（2）在细粒度缺陷预测中，单行代码中可获得的语义和结构信息有限，代码语义信息内容不足以实现语义区分。针对上述问题，我们提出了一种基于深度学习的两阶段行级缺陷预测方法，称为 LineFlowDP。我们首先提取源文件的程序依赖图（PDG）。然后使用关系图卷积网络进一步训练模型。最后，使用图解释器 GNNExplainer 和社会网络分析方法对缺陷文件中的代码行进行风险排序。在来自 9 个项目的 32 个数据集上，实验结果表明 LineFlowDP 比四种最先进的行级缺陷预测方法的性价比高出 13%-404%。流量信息扩展和代码行风险排序方法的有效性也通过消融实验得到了验证。

{"title":"LineFlowDP: A Deep Learning-Based Two-Phase Approach for Line-Level Defect Prediction","authors":"Fengyu Yang, Fa Zhong, Guangdong Zeng, Peng Xiao, Wei Zheng","doi":"10.1007/s10664-023-10439-z","DOIUrl":"https://doi.org/10.1007/s10664-023-10439-z","url":null,"abstract":"<p>Software defect prediction plays a key role in guiding resource allocation for software testing. However, previous defect prediction studies still have some limitations: (1) the granularity of defect prediction is still coarse, so high-risk code statements cannot be accurately located; (2) in fine-grained defect prediction, the semantic and structural information available in a single line of code is limited, and the content of code semantic information is not sufficient to achieve semantic differentiation. To address the above problems, we propose a two-phase line-level defect prediction method based on deep learning called LineFlowDP. We first extract the program dependency graph (PDG) of the source files. The lines of code corresponding to the nodes in the PDG are extended semantically with data flow and control flow information and embedded as nodes, and the model is further trained using an relational graph convolutional network. Finally, a graph interpreter GNNExplainer and a social network analysis method are used to rank the lines of code in the defective file according to risk. On 32 datasets from 9 projects, the experimental results show that LineFlowDP is 13%-404% more cost-effective than four state-of-the-art line-level defect prediction methods. The effectiveness of the flow information extension and code line risk ranking methods was also verified via ablation experiments.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"3 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT 利用 ML 集合和 C-BERT 分析 D2A 数据集中的源代码漏洞

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-02-22 DOI: 10.1007/s10664-023-10405-9

Saurabh Pujar, Yunhui Zheng, Luca Buratti, Burn Lewis, Yunchung Chen, Jim Laredo, Alessandro Morari, Edward Epstein, Tsungnan Lin, Bo Yang, Zhong Su

Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.

静态分析工具可分析具有复杂行为和数百万行代码的程序，因此被广泛用于漏洞检测。尽管静态分析工具很受欢迎，但众所周知，它们会产生过多的误报。最近，机器学习模型能够从编程语言数据中学习，这为静态分析减少误报提供了新的可能性。然而，用于训练漏洞识别模型的现有数据集存在多种局限性，如有限的错误上下文、有限的大小以及合成和不真实的源代码。我们提出了差分数据集分析或 D2A，这是一种基于差分分析的方法，用于标记静态分析工具报告的问题。使用这种方法建立的数据集称为 D2A 数据集。D2A 数据集是通过分析多个开源项目的版本对建立的。我们从每个项目中选择错误修复提交，并对这些提交前后的版本运行静态分析。如果在提交前版本中检测到的某些问题在相应的提交后版本中消失了，那么它们很有可能是提交后得到修复的真正错误。我们使用 D2A 生成一个大型标签数据集。然后，我们使用 D2A 数据集训练用于漏洞识别的经典机器学习模型和深度学习模型。我们的研究表明，该数据集可用于构建分类器，以识别静态分析报告的问题中可能存在的误报，从而帮助开发人员确定优先级并首先调查潜在的真阳性问题。为了促进未来研究并为社区做出贡献，我们公开了数据集生成管道和数据集。我们还基于 D2A 数据集创建了一个排行榜，该排行榜已经吸引了社区的关注和参与。

{"title":"Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT","authors":"Saurabh Pujar, Yunhui Zheng, Luca Buratti, Burn Lewis, Yunchung Chen, Jim Laredo, Alessandro Morari, Edward Epstein, Tsungnan Lin, Bo Yang, Zhong Su","doi":"10.1007/s10664-023-10405-9","DOIUrl":"https://doi.org/10.1007/s10664-023-10405-9","url":null,"abstract":"<p>Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"4 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating the impact of flaky simulators on testing autonomous driving systems 评估片状模拟器对自动驾驶系统测试的影响

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-02-21 DOI: 10.1007/s10664-023-10433-5

Mohammad Hossein Amini, Shervin Naseri, Shiva Nejati

Simulators are widely used to test Autonomous Driving Systems (ADS), but their potential flakiness can lead to inconsistent test results. We investigate test flakiness in simulation-based testing of ADS by addressing two key questions: (1) How do flaky ADS simulations impact automated testing that relies on randomized algorithms? and (2) Can machine learning (ML) effectively identify flaky ADS tests while decreasing the required number of test reruns? Our empirical results, obtained from two widely-used open-source ADS simulators and five diverse ADS test setups, show that test flakiness in ADS is a common occurrence and can significantly impact the test results obtained by randomized algorithms. Further, our ML classifiers effectively identify flaky ADS tests using only a single test run, achieving F1-scores of 85%, 82% and 96% for three different ADS test setups. Our classifiers significantly outperform our non-ML baseline, which requires executing tests at least twice, by 31%, 21%, and 13% in F1-score performance, respectively. We conclude with a discussion on the scope, implications and limitations of our study. We provide our complete replication package in a Github repository (Github paper 2023).

模拟器被广泛用于测试自动驾驶系统（ADS），但其潜在的不稳定性会导致测试结果不一致。我们通过解决两个关键问题来研究基于模拟的 ADS 测试中的测试缺陷：(1) ADS 模拟缺陷如何影响依赖于随机算法的自动测试？ (2) 机器学习（ML）能否有效识别 ADS 测试缺陷，同时减少所需的测试重试次数？我们从两个广泛使用的开源 ADS 模拟器和五个不同的 ADS 测试设置中获得的实证结果表明，ADS 中测试不稳定是一种常见现象，会严重影响随机算法获得的测试结果。此外，我们的 ML 分类器仅使用一次测试运行就能有效识别出不稳定的 ADS 测试，在三种不同的 ADS 测试设置中分别取得了 85%、82% 和 96% 的 F1 分数。我们的分类器在 F1 分数性能上分别比需要至少执行两次测试的非ML 基线高出 31%、21% 和 13%。最后，我们讨论了研究的范围、意义和局限性。我们在 Github 存储库中提供了完整的复制包（Github 论文 2023）。

{"title":"Evaluating the impact of flaky simulators on testing autonomous driving systems","authors":"Mohammad Hossein Amini, Shervin Naseri, Shiva Nejati","doi":"10.1007/s10664-023-10433-5","DOIUrl":"https://doi.org/10.1007/s10664-023-10433-5","url":null,"abstract":"<p>Simulators are widely used to test Autonomous Driving Systems (ADS), but their potential flakiness can lead to inconsistent test results. We investigate test flakiness in simulation-based testing of ADS by addressing two key questions: (1) How do flaky ADS simulations impact automated testing that relies on randomized algorithms? and (2) Can machine learning (ML) effectively identify flaky ADS tests while decreasing the required number of test reruns? Our empirical results, obtained from two widely-used open-source ADS simulators and five diverse ADS test setups, show that test flakiness in ADS is a common occurrence and can significantly impact the test results obtained by randomized algorithms. Further, our ML classifiers effectively identify flaky ADS tests using only a single test run, achieving F1-scores of 85%, 82% and 96% for three different ADS test setups. Our classifiers significantly outperform our non-ML baseline, which requires executing tests at least twice, by 31%, 21%, and 13% in F1-score performance, respectively. We conclude with a discussion on the scope, implications and limitations of our study. We provide our complete replication package in a Github repository (Github paper 2023).</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"39 Suppl 1 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Studying the impact of risk assessment analytics on risk awareness and code review performance 研究风险评估分析对风险意识和代码审查绩效的影响

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering

Pub Date : 2024-02-17 DOI: 10.1007/s10664-024-10443-x

Abstract

While code review is a critical component of modern software quality assurance, defects can still slip through the review process undetected. Previous research suggests that the main reason for this is a lack of reviewer awareness about the likelihood of defects in proposed changes; even experienced developers may struggle to evaluate the potential risks. If a change’s riskiness is underestimated, it may not receive adequate attention during review, potentially leading to defects being introduced into the codebase. In this paper, we investigate how risk assessment analytics can influence the level of awareness among developers regarding the potential risks associated with code changes; we also study how effective and efficient reviewers are at detecting defects during code review with the use of such analytics. We conduct a controlled experiment using Gherald, a risk assessment prototype tool that analyzes the riskiness of change sets based on historical data. Following a between-subjects experimental design, we assign participants to the treatment (i.e., with access to Gherald) or control group. All participants are asked to perform risk assessment and code review tasks. Through our experiment with 48 participants, we find that the use of Gherald is associated with statistically significant improvements (one-tailed, unpaired Mann-Whitney U test, (alpha ) = 0.05) in developer awareness of riskiness of code changes and code review effectiveness. Moreover, participants in the treatment group tend to identify the known defects more quickly than those in the control group; however, the difference between the two groups is not statistically significant. Our results lead us to conclude that the adoption of a risk assessment tool has a positive impact on code review practices, which provides valuable insights for practitioners seeking to enhance their code review process and highlights the importance for further research to explore more effective and practical risk assessment approaches.

摘要虽然代码审查是现代软件质量保证的重要组成部分，但缺陷仍有可能通过审查过程而不被发现。以往的研究表明，造成这种情况的主要原因是审查员对拟议变更中出现缺陷的可能性缺乏认识；即使是经验丰富的开发人员也可能难以评估潜在的风险。如果低估了变更的风险性，它就可能在审核过程中得不到足够的重视，从而可能导致缺陷被引入代码库。在本文中，我们将研究风险评估分析如何影响开发人员对代码变更潜在风险的认识水平；我们还将研究在使用此类分析的情况下，审核人员在代码审核过程中发现缺陷的效果和效率。我们使用风险评估原型工具 Gherald 进行了一项对照实验，该工具可根据历史数据对变更集的风险性进行分析。按照主体间实验设计，我们将参与者分配到处理组（即可以访问 Gherald）或对照组。所有参与者都被要求执行风险评估和代码审查任务。通过对 48 名参与者的实验，我们发现 Gherald 的使用与开发人员对代码变更风险意识和代码审查有效性的统计意义上的显著提高相关（单尾、非对称 Mann-Whitney U 检验，(α ) = 0.05）。此外，治疗组的参与者往往比对照组的参与者更快地识别出已知缺陷；但是，两组之间的差异在统计学上并不显著。我们的研究结果使我们得出结论，采用风险评估工具对代码审查实践有积极影响，这为寻求加强代码审查流程的从业人员提供了宝贵的见解，并强调了进一步研究探索更有效、更实用的风险评估方法的重要性。

{"title":"Studying the impact of risk assessment analytics on risk awareness and code review performance","authors":"","doi":"10.1007/s10664-024-10443-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10443-x","url":null,"abstract":"<h3>Abstract</h3> <p>While code review is a critical component of modern software quality assurance, defects can still slip through the review process undetected. Previous research suggests that the main reason for this is a lack of reviewer awareness about the likelihood of defects in proposed changes; even experienced developers may struggle to evaluate the potential risks. If a change’s riskiness is underestimated, it may not receive adequate attention during review, potentially leading to defects being introduced into the codebase. In this paper, we investigate how risk assessment analytics can influence the level of awareness among developers regarding the potential risks associated with code changes; we also study how effective and efficient reviewers are at detecting defects during code review with the use of such analytics. We conduct a controlled experiment using <span>Gherald</span>, a risk assessment prototype tool that analyzes the riskiness of change sets based on historical data. Following a between-subjects experimental design, we assign participants to the treatment (i.e., with access to <span>Gherald</span>) or control group. All participants are asked to perform risk assessment and code review tasks. Through our experiment with 48 participants, we find that the use of <span>Gherald</span> is associated with statistically significant improvements (one-tailed, unpaired Mann-Whitney U test, <span> <span>(alpha )</span> </span> = 0.05) in developer awareness of riskiness of code changes and code review effectiveness. Moreover, participants in the treatment group tend to identify the known defects more quickly than those in the control group; however, the difference between the two groups is not statistically significant. Our results lead us to conclude that the adoption of a risk assessment tool has a positive impact on code review practices, which provides valuable insights for practitioners seeking to enhance their code review process and highlights the importance for further research to explore more effective and practical risk assessment approaches.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"33 1","pages":""},"PeriodicalIF":4.1,"publicationDate":"2024-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139904184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0