首页 > 最新文献

Empirical Software Engineering最新文献

英文 中文
Towards Trusted Smart Contracts: A Comprehensive Test Suite For Vulnerability Detection 实现可信的智能合约:漏洞检测综合测试套件
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-25 DOI: 10.1007/s10664-024-10509-w
Andrei Arusoaie, Ștefan-Claudiu Susan

The term smart contract was originally used to describe automated legal contracts. Nowadays, it refers to special programs that run on blockchain platforms and are popular in decentralized applications. In recent years, vulnerabilities in smart contracts caused significant financial losses. Researchers have proposed methods and tools for detecting them and have demonstrated their effectiveness using various test suites. In this paper, we aim to improve the current approach to measuring the effectiveness of vulnerability detectors in smart contracts. First, we identify several traits of existing test suites used to assess tool effectiveness. We explain how these traits limit the evaluation and comparison of vulnerability detection tools. Next, we propose a new test suite that prioritizes diversity over quantity, utilizing a comprehensive taxonomy to achieve this. Our organized test suite enables insightful evaluations and more precise comparisons among vulnerability detection tools. We demonstrate the benefits of our test suite by comparing several vulnerability detection tools using two sets of metrics. Results show that the tools we included in our comparison cover less than half of the vulnerabilities in the new test suite. Finally, based on our results, we answer several questions that we pose in the introduction of the paper about the effectiveness of the compared tools.

智能合约一词最初用于描述自动化法律合同。如今,它指的是在区块链平台上运行的特殊程序,在去中心化应用中很受欢迎。近年来,智能合约中的漏洞造成了巨大的经济损失。研究人员提出了检测这些漏洞的方法和工具,并利用各种测试套件证明了它们的有效性。本文旨在改进目前测量智能合约漏洞检测器有效性的方法。首先,我们确定了用于评估工具有效性的现有测试套件的几个特征。我们解释了这些特征如何限制了漏洞检测工具的评估和比较。接下来,我们提出了一个新的测试套件,该套件优先考虑多样性而非数量,并利用全面的分类法来实现这一目标。我们组织的测试套件能够对漏洞检测工具进行深入评估和更精确的比较。我们使用两套指标对几种漏洞检测工具进行了比较,从而展示了我们的测试套件的优势。结果显示,我们纳入比较范围的工具只覆盖了新测试套件中不到一半的漏洞。最后,基于我们的结果,我们回答了本文引言中提出的有关比较工具有效性的几个问题。
{"title":"Towards Trusted Smart Contracts: A Comprehensive Test Suite For Vulnerability Detection","authors":"Andrei Arusoaie, Ștefan-Claudiu Susan","doi":"10.1007/s10664-024-10509-w","DOIUrl":"https://doi.org/10.1007/s10664-024-10509-w","url":null,"abstract":"<p>The term <i>smart contract</i> was originally used to describe automated legal contracts. Nowadays, it refers to special programs that run on blockchain platforms and are popular in decentralized applications. In recent years, vulnerabilities in smart contracts caused significant financial losses. Researchers have proposed methods and tools for detecting them and have demonstrated their effectiveness using various test suites. In this paper, we aim to improve the current approach to measuring the effectiveness of vulnerability detectors in smart contracts. First, we identify several traits of existing test suites used to assess tool effectiveness. We explain how these traits limit the evaluation and comparison of vulnerability detection tools. Next, we propose a new test suite that prioritizes diversity over quantity, utilizing a comprehensive taxonomy to achieve this. Our organized test suite enables insightful evaluations and more precise comparisons among vulnerability detection tools. We demonstrate the benefits of our test suite by comparing several vulnerability detection tools using two sets of metrics. Results show that the tools we included in our comparison cover less than half of the vulnerabilities in the new test suite. Finally, based on our results, we answer several questions that we pose in the introduction of the paper about the effectiveness of the compared tools.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic title completion for Stack Overflow posts and GitHub issues 自动补全 Stack Overflow 帖子和 GitHub 问题的标题
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-25 DOI: 10.1007/s10664-024-10513-0
Xiang Chen, Wenlong Pei, Shaoyu Yang, Yanlin Zhou, Zichen Zhang, Jiahua Pei

Title quality is important for different software engineering communities. For example, in Stack Overflow, posts with low-quality question titles often discourage potential answerers. In GitHub, issues with low-quality titles can make it difficult for developers to grasp the core idea of the problem. In previous studies, researchers mainly focused on generating titles from scratch by analyzing the body contents, such as the post body for Stack Overflow question title generation (SOTG) and the issue body for issue title generation (ISTG). However, the quality of the generated titles is still limited by the information available in the body contents. A more effective way is to provide accurate completion suggestions when developers compose titles. Inspired by this idea, we are the first to study the problem of automatic title completion for software engineering title generation tasks and propose the approach TC4SETG. Specifically, we first preprocess the gathered titles to form incomplete titles (i.e., tip information provided by developers) for simulating the title completion scene. Then we construct the input by concatenating the incomplete title with the body’s content. Finally, we fine-tune the pre-trained model CodeT5 to learn the title completion patterns effectively. To evaluate the effectiveness of TC4SETG, we selected 189,655 high-quality posts from Stack Overflow by covering eight popular programming languages for the SOTG task and 333,563 issues in the top-200 starred repositories on GitHub for the ISTG task. Our empirical results show that compared with the approaches of generating question titles from scratch, our proposed approach TC4SETG is more practical in automatic and human evaluation. Our experimental results demonstrate that TC4SETG outperforms corresponding state-of-the-art baselines in the SOTG task by a minimum of 25.82% and in the ISTG task by at least 45.48% in terms of ROUGE-L. Therefore, our study provides a new direction for studying automatic software engineering title generation and calls for more researchers to investigate this direction in the future.

标题质量对不同的软件工程社区都很重要。例如,在 Stack Overflow,低质量问题标题的帖子通常会让潜在的回答者望而却步。在 GitHub,标题质量低的问题会让开发人员难以掌握问题的核心思想。在以往的研究中,研究人员主要通过分析正文内容来从头开始生成标题,如用于生成 Stack Overflow 问题标题(SOTG)的帖子正文和用于生成问题标题(ISTG)的问题正文。然而,生成标题的质量仍然受到正文内容信息的限制。更有效的方法是在开发人员编写标题时提供准确的完成建议。受此启发,我们首次研究了软件工程标题生成任务中的标题自动补全问题,并提出了 TC4SETG 方法。具体来说,我们首先对收集到的标题进行预处理,形成不完整的标题(即开发人员提供的提示信息),以模拟标题补全场景。然后,我们将不完整的标题与正文内容连接起来,构建输入内容。最后,我们对预先训练好的模型 CodeT5 进行微调,以有效地学习标题完成模式。为了评估 TC4SETG 的有效性,我们从 Stack Overflow 中选取了 189655 个高质量帖子(涵盖八种流行编程语言)作为 SOTG 任务,并从 GitHub 上排名前 200 的星级资源库中选取了 333563 个问题作为 ISTG 任务。我们的实证结果表明,与从头开始生成问题标题的方法相比,我们提出的 TC4SETG 方法在自动和人工评估方面更加实用。我们的实验结果表明,在 SOTG 任务中,TC4SETG 的 ROUGE-L 至少比相应的一流基线高 25.82%,在 ISTG 任务中,TC4SETG 的 ROUGE-L 至少比相应的一流基线高 45.48%。因此,我们的研究为研究自动生成软件工程标题提供了一个新方向,并呼吁更多研究人员在未来研究这一方向。
{"title":"Automatic title completion for Stack Overflow posts and GitHub issues","authors":"Xiang Chen, Wenlong Pei, Shaoyu Yang, Yanlin Zhou, Zichen Zhang, Jiahua Pei","doi":"10.1007/s10664-024-10513-0","DOIUrl":"https://doi.org/10.1007/s10664-024-10513-0","url":null,"abstract":"<p>Title quality is important for different software engineering communities. For example, in Stack Overflow, posts with low-quality question titles often discourage potential answerers. In GitHub, issues with low-quality titles can make it difficult for developers to grasp the core idea of the problem. In previous studies, researchers mainly focused on generating titles from scratch by analyzing the body contents, such as the post body for Stack Overflow question title generation (SOTG) and the issue body for issue title generation (ISTG). However, the quality of the generated titles is still limited by the information available in the body contents. A more effective way is to provide accurate completion suggestions when developers compose titles. Inspired by this idea, we are the first to study the problem of automatic title completion for software engineering title generation tasks and propose the approach <span>TC4SETG</span>. Specifically, we first preprocess the gathered titles to form incomplete titles (i.e., tip information provided by developers) for simulating the title completion scene. Then we construct the input by concatenating the incomplete title with the body’s content. Finally, we fine-tune the pre-trained model CodeT5 to learn the title completion patterns effectively. To evaluate the effectiveness of <span>TC4SETG</span>, we selected 189,655 high-quality posts from Stack Overflow by covering eight popular programming languages for the SOTG task and 333,563 issues in the top-200 starred repositories on GitHub for the ISTG task. Our empirical results show that compared with the approaches of generating question titles from scratch, our proposed approach <span>TC4SETG</span> is more practical in automatic and human evaluation. Our experimental results demonstrate that <span>TC4SETG</span> outperforms corresponding state-of-the-art baselines in the SOTG task by a minimum of 25.82% and in the ISTG task by at least 45.48% in terms of ROUGE-L. Therefore, our study provides a new direction for studying automatic software engineering title generation and calls for more researchers to investigate this direction in the future.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Can we spot energy regressions using developers tests? 我们能否通过开发人员测试发现能量倒退?
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-25 DOI: 10.1007/s10664-023-10429-1
Benjamin Danglot, Jean-Rémy Falleri, Romain Rouvoy

Context

Software Energy Consumption is gaining more and more attention. In this paper, we tackle the problem of warning developers about the increase of SEC of their programs during Continuous Integration (CI).

Objective

In this study, we investigate if the CI can leverage developers’ tests to perform energy regression testing. Energy regression is similar to performance regression but focuses on the energy consumption of the program instead of standard performance indicators, like execution time or memory consumption.

Method

We perform an exploratory study of the usage of developers’ tests for energy regression testing. We first investigate if developers’ tests can be used to obtain stable SEC indicators. Then, we evaluate if comparing the SEC of developers’ tests between two versions can pinpoint energy regressions introduced by automated program mutations. Finally, we manually evaluate several real commits pinpointed by our approach.

Results

Our study will pave the way for automated SEC regression tools that can be readily deployed inside an existing CI infrastructure to raise awareness of SEC issues among practitioners.

背景软件能源消耗越来越受到关注。在本文中,我们要解决的问题是在持续集成(CI)过程中向开发人员发出关于其程序能耗增加的警告。能源回归与性能回归类似,但重点是程序的能源消耗,而不是标准的性能指标,如执行时间或内存消耗。我们首先研究开发人员测试是否能用于获得稳定的 SEC 指标。然后,我们评估了在两个版本之间比较开发人员测试的 SEC 是否能准确定位自动程序突变带来的能耗回归。结果我们的研究将为自动 SEC 回归工具铺平道路,这些工具可随时部署在现有的 CI 基础架构中,以提高从业人员对 SEC 问题的认识。
{"title":"Can we spot energy regressions using developers tests?","authors":"Benjamin Danglot, Jean-Rémy Falleri, Romain Rouvoy","doi":"10.1007/s10664-023-10429-1","DOIUrl":"https://doi.org/10.1007/s10664-023-10429-1","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">\u0000<b>Context</b>\u0000</h3><p><i>Software Energy Consumption</i> is gaining more and more attention. In this paper, we tackle the problem of warning developers about the increase of SEC of their programs during <i>Continuous Integration</i> (CI).</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Objective</b>\u0000</h3><p>In this study, we investigate if the CI can leverage developers’ tests to perform <i>energy regression testing</i>. Energy regression is similar to performance regression but focuses on the energy consumption of the program instead of standard performance indicators, like execution time or memory consumption.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Method</b>\u0000</h3><p>We perform an exploratory study of the usage of developers’ tests for energy regression testing. We first investigate if developers’ tests can be used to obtain stable SEC indicators. Then, we evaluate if comparing the SEC of developers’ tests between two versions can pinpoint energy regressions introduced by automated program mutations. Finally, we manually evaluate several real commits pinpointed by our approach.</p><h3 data-test=\"abstract-sub-heading\">\u0000<b>Results</b>\u0000</h3><p>Our study will pave the way for automated SEC regression tools that can be readily deployed inside an existing CI infrastructure to raise awareness of SEC issues among practitioners.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On Refining the SZZ Algorithm with Bug Discussion Data 利用错误讨论数据完善 SZZ 算法
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-24 DOI: 10.1007/s10664-024-10511-2
Pooja Rani, Fernando Petrulio, Alberto Bacchelli

Context

Researchers testing hypotheses related to factors leading to low-quality software often rely on historical data, specifically on details regarding when defects were introduced into a codebase of interest. The prevailing techniques to determine the introduction of defects revolve around variants of the SZZ algorithm. This algorithm leverages information on the lines modified during a bug-fixing commit and finds when these lines were last modified, thereby identifying bug-introducing commits.

Objectives

Despite several improvements and variants, SZZ struggles with accuracy, especially in cases of unrelated modifications or that touch files not involved in the introduction of the bug in the version control systems (aka tangled commit and ghost commits).

Methods

Our research investigates whether and how incorporating content retrieved from bug discussions can address these issues by identifying the related and external files and thus improve the efficacy of the SZZ algorithm.

Results

To conduct our investigation, we take advantage of the links manually inserted by Mozilla developers in bug reports to signal which commits inserted bugs. Thus, we prepared the dataset, RoTEB, comprised of 12,472 bug reports. We first manually inspect a sample of 369 bug reports related to these bug-fixing or bug-introducing commits and investigate whether the files mentioned in these reports could be useful for SZZ. After we found evidence that the mentioned files are relevant, we augment SZZ with this information, using different strategies, and evaluate the resulting approach against multiple SZZ variations.

Conclusion

We define a taxonomy outlining the rationale behind developers’ references to diverse files in their discussions. We observe that bug discussions often mention files relevant to enhancing the SZZ algorithm’s efficacy. Then, we verify that integrating these file references augments the precision of SZZ in pinpointing bug-introducing commits. Yet, it does not markedly influence recall. These results deepen our comprehension of the usefulness of bug discussions for SZZ. Future work can leverage our dataset and explore other techniques to further address the problem of tangled commits and ghost commits. Data & material: https://zenodo.org/records/11484723.

背景研究人员在测试与导致低质量软件的因素有关的假设时,往往依赖于历史数据,特别是有关缺陷何时被引入相关代码库的详细信息。确定缺陷引入时间的主流技术围绕着 SZZ 算法的变体展开。尽管对 SZZ 进行了多次改进和变体,但其准确性仍有问题,尤其是在不相关的修改或触及版本控制系统中与引入缺陷无关的文件(又称纠缠提交和幽灵提交)的情况下。方法我们的研究调查了从错误讨论中获取的内容是否以及如何通过识别相关文件和外部文件来解决这些问题,从而提高 SZZ 算法的效率。结果为了进行调查,我们利用了 Mozilla 开发人员在错误报告中手动插入的链接,以显示哪些提交插入了错误。因此,我们准备了由 12,472 份错误报告组成的数据集 RoTEB。我们首先手动检查了与这些修复错误或引入错误的提交相关的 369 份错误报告样本,并调查这些报告中提到的文件是否对 SZZ 有用。在我们发现所提及文件具有相关性的证据后,我们使用不同的策略用这些信息增强了 SZZ,并针对多个 SZZ 变体评估了由此产生的方法。我们发现,错误讨论中经常提到与提高 SZZ 算法效率相关的文件。然后,我们验证了整合这些文件引用可以提高 SZZ 在精确定位引入错误的提交方面的精确度。然而,这并不会明显影响召回率。这些结果加深了我们对错误讨论对 SZZ 有用性的理解。未来的工作可以利用我们的数据集,探索其他技术,以进一步解决纠结提交和幽灵提交的问题。数据& 材料:https://zenodo.org/records/11484723。
{"title":"On Refining the SZZ Algorithm with Bug Discussion Data","authors":"Pooja Rani, Fernando Petrulio, Alberto Bacchelli","doi":"10.1007/s10664-024-10511-2","DOIUrl":"https://doi.org/10.1007/s10664-024-10511-2","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Researchers testing hypotheses related to factors leading to low-quality software often rely on historical data, specifically on details regarding when defects were introduced into a codebase of interest. The prevailing techniques to determine the introduction of defects revolve around variants of the <span>SZZ</span> algorithm. This algorithm leverages information on the lines modified during a bug-fixing commit and finds when these lines were last modified, thereby identifying bug-introducing commits.</p><h3 data-test=\"abstract-sub-heading\">Objectives</h3><p>Despite several improvements and variants, <span>SZZ</span> struggles with accuracy, especially in cases of unrelated modifications or that touch files not involved in the introduction of the bug in the version control systems (aka <i>tangled commit</i> and <i>ghost commits</i>).</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>Our research investigates whether and how incorporating content retrieved from bug discussions can address these issues by identifying the related and external files and thus improve the efficacy of the <span>SZZ</span> algorithm.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>To conduct our investigation, we take advantage of the links manually inserted by Mozilla developers in bug reports to signal which commits inserted bugs. Thus, we prepared the dataset, <i>RoTEB</i>, comprised of 12,472 bug reports. We first manually inspect a sample of 369 bug reports related to these bug-fixing or bug-introducing commits and investigate whether the files mentioned in these reports could be useful for <span>SZZ</span>. After we found evidence that the mentioned files are relevant, we augment <span>SZZ</span> with this information, using different strategies, and evaluate the resulting approach against multiple <span>SZZ</span> variations.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>We define a taxonomy outlining the rationale behind developers’ references to diverse files in their discussions. We observe that bug discussions often mention files relevant to enhancing the <span>SZZ</span> algorithm’s efficacy. Then, we verify that integrating these file references augments the precision of <span>SZZ</span> in pinpointing bug-introducing commits. Yet, it does not markedly influence recall. These results deepen our comprehension of the usefulness of bug discussions for <span>SZZ</span>. Future work can leverage our dataset and explore other techniques to further address the problem of tangled commits and ghost commits. Data &amp; material: https://zenodo.org/records/11484723.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Test-based patch clustering for automatically-generated patches assessment 基于测试的补丁聚类,用于自动生成的补丁评估
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-24 DOI: 10.1007/s10664-024-10503-2
Matias Martinez, Maria Kechagia, Anjana Perera, Justyna Petke, Federica Sarro, Aldeida Aleti

Previous studies have shown that Automated Program Repair (apr) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by apr tools need to be validated by human programmers, which can be very costly, and prevents apr tool adoption in practice. Our work aims to minimize the number of plausible patches that programmers have to review, thereby reducing the time required to find a correct patch. We introduce a novel light-weight test-based patch clustering approach called xTestCluster, which clusters patches based on their dynamic behavior. xTestCluster is applied after the patch generation phase in order to analyze the generated patches from one or more repair tools and to provide more information about those patches for facilitating patch assessment. The novelty of xTestCluster lies in using information from execution of newly generated test cases to cluster patches generated by multiple APR approaches. A cluster is formed of patches that fail on the same generated test cases. The output from xTestCluster gives developers a) a way of reducing the number of patches to analyze, as they can focus on analyzing a sample of patches from each cluster, b) additional information (new test cases and their results) attached to each patch. After analyzing 902 plausible patches from 21 Java apr tools, our results show that xTestCluster is able to reduce the number of patches to review and analyze with a median of 50%. xTestCluster can save a significant amount of time for developers that have to review the multitude of patches generated by apr tools, and provides them with new test cases that expose the differences in behavior between generated patches. Moreover, xTestCluster can complement other patch assessment techniques that help detect patch misclassifications.

以往的研究表明,自动程序修复(apr)技术存在过度拟合问题。当补丁运行时,测试套件没有发现任何错误,但补丁实际上并没有修复潜在的错误,或者引入了测试套件没有涵盖的新缺陷时,就会出现过拟合问题。因此,apr 工具生成的补丁需要由人类程序员进行验证,这可能会耗费大量成本,并阻碍apr 工具的实际应用。我们的工作旨在最大限度地减少程序员需要审查的可信补丁的数量,从而减少找到正确补丁所需的时间。我们引入了一种名为 xTestCluster 的新型轻量级基于测试的补丁聚类方法,该方法根据补丁的动态行为对补丁进行聚类。xTestCluster 在补丁生成阶段之后应用,目的是分析由一个或多个修复工具生成的补丁,并提供有关这些补丁的更多信息,以促进补丁评估。xTestCluster 的新颖之处在于利用新生成的测试用例的执行信息,对多种 APR 方法生成的补丁进行聚类。在相同生成的测试用例中失败的补丁组成一个群集。xTestCluster 的输出为开发人员提供了 a) 减少要分析的补丁数量的方法,因为他们可以集中分析每个群组中的补丁样本;b) 附加到每个补丁的额外信息(新测试用例及其结果)。在分析了来自 21 个 Java apr 工具的 902 个似是而非的补丁后,我们的结果表明 xTestCluster 能够将需要审查和分析的补丁数量减少 50%。此外,xTestCluster 还能补充其他补丁评估技术,帮助检测补丁的错误分类。
{"title":"Test-based patch clustering for automatically-generated patches assessment","authors":"Matias Martinez, Maria Kechagia, Anjana Perera, Justyna Petke, Federica Sarro, Aldeida Aleti","doi":"10.1007/s10664-024-10503-2","DOIUrl":"https://doi.org/10.1007/s10664-024-10503-2","url":null,"abstract":"<p>Previous studies have shown that Automated Program Repair (<span>apr</span>) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by <span>apr</span> tools need to be validated by human programmers, which can be very costly, and prevents <span>apr</span> tool adoption in practice. Our work aims to minimize the number of plausible patches that programmers have to review, thereby reducing the time required to find a correct patch. We introduce a novel light-weight test-based patch clustering approach called <span>xTestCluster</span>, which clusters patches based on their dynamic behavior. <span>xTestCluster</span> is applied after the patch generation phase in order to analyze the generated patches from one or more repair tools and to provide more information about those patches for facilitating patch assessment. The novelty of <span>xTestCluster</span> lies in using information from execution of newly generated test cases to cluster patches generated by multiple APR approaches. A cluster is formed of patches that fail on the same generated test cases. The output from <span>xTestCluster</span> gives developers <i>a)</i> a way of reducing the number of patches to analyze, as they can focus on analyzing a sample of patches from each cluster, <i>b)</i> additional information (new test cases and their results) attached to each patch. After analyzing 902 plausible patches from 21 Java <span>apr</span> tools, our results show that <span>xTestCluster</span> is able to reduce the number of patches to review and analyze with a median of 50%. <span>xTestCluster</span> can save a significant amount of time for developers that have to review the multitude of patches generated by <span>apr</span> tools, and provides them with new test cases that expose the differences in behavior between generated patches. Moreover, <span>xTestCluster</span> can complement other patch assessment techniques that help detect patch misclassifications.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Free open source communities sustainability: Does it make a difference in software quality? 自由开放源码社区的可持续性:它对软件质量有影响吗?
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-23 DOI: 10.1007/s10664-024-10529-6
Adam Alami, Raúl Pardo, Johan Linåker

Context

Free and Open Source Software (FOSS) communities’ ability to stay viable and productive over time is pivotal for society as they maintain the building blocks that digital infrastructure, products, and services depend on. Sustainability may, however, be characterized from multiple aspects, and less is known how these aspects interplay and impact community outputs, and software quality specifically.

Objective

This study, therefore, aims to empirically explore how the different aspects of FOSS sustainability impact software quality.

Method

16 sustainability metrics across four categories were sampled and applied to a set of 217 OSS projects sourced from the Apache Software Foundation Incubator program. The impact of a decline in the sustainability metrics was analyzed against eight software quality metrics using Bayesian data analysis, which incorporates probability distributions to represent the regression coefficients and intercepts.

Results

Findings suggest that selected sustainability metrics do not significantly affect defect density or code coverage. However, a positive impact of community age was observed on specific code quality metrics, such as risk complexity, number of very large files, and code duplication percentage. Interestingly, findings show that even when communities are experiencing sustainability, certain code quality metrics are negatively impacted.

Conclusion

Findings imply that code quality practices are not consistently linked to sustainability, and defect management and prevention may be prioritized over the former. Results suggest that growth, resulting in a more complex and large codebase, combined with a probable lack of understanding of code quality standards, may explain the degradation in certain aspects of code quality.

背景自由与开源软件(FOSS)社区长期保持生命力和生产力的能力对社会至关重要,因为它们维护着数字基础设施、产品和服务所依赖的基石。因此,本研究旨在通过实证方法探讨自由与开放源码软件可持续性的不同方面如何影响软件质量。研究方法从阿帕奇软件基金会孵化器项目中选取了 217 个开放源码软件项目,对其中四个类别的 16 个可持续性指标进行了抽样,并将其应用于这些项目。使用贝叶斯数据分析法分析了可持续发展指标下降对八个软件质量指标的影响,贝叶斯数据分析法采用概率分布来表示回归系数和截距。然而,社区年龄对特定的代码质量指标有积极影响,如风险复杂性、超大文件数量和代码重复百分比。有趣的是,研究结果表明,即使社区实现了可持续发展,某些代码质量指标也会受到负面影响。结论研究结果表明,代码质量实践与可持续发展的关系并不一致,缺陷管理和预防可能优先于前者。结果表明,增长导致代码库更加复杂和庞大,再加上可能缺乏对代码质量标准的理解,这可能是代码质量某些方面下降的原因。
{"title":"Free open source communities sustainability: Does it make a difference in software quality?","authors":"Adam Alami, Raúl Pardo, Johan Linåker","doi":"10.1007/s10664-024-10529-6","DOIUrl":"https://doi.org/10.1007/s10664-024-10529-6","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>Free and Open Source Software (FOSS) communities’ ability to stay viable and productive over time is pivotal for society as they maintain the building blocks that digital infrastructure, products, and services depend on. Sustainability may, however, be characterized from multiple aspects, and less is known how these aspects interplay and impact community outputs, and software quality specifically.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>This study, therefore, aims to empirically explore how the different aspects of FOSS sustainability impact software quality.</p><h3 data-test=\"abstract-sub-heading\">Method</h3><p>16 sustainability metrics across four categories were sampled and applied to a set of 217 OSS projects sourced from the Apache Software Foundation Incubator program. The impact of a decline in the sustainability metrics was analyzed against eight software quality metrics using Bayesian data analysis, which incorporates probability distributions to represent the regression coefficients and intercepts.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Findings suggest that selected sustainability metrics do not significantly affect defect density or code coverage. However, a positive impact of community age was observed on specific code quality metrics, such as risk complexity, number of very large files, and code duplication percentage. Interestingly, findings show that even when communities are experiencing sustainability, certain code quality metrics are negatively impacted.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>Findings imply that code quality practices are not consistently linked to sustainability, and defect management and prevention may be prioritized over the former. Results suggest that growth, resulting in a more complex and large codebase, combined with a probable lack of understanding of code quality standards, may explain the degradation in certain aspects of code quality.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explaining poor performance of text-based machine learning models for vulnerability detection 解释基于文本的机器学习模型在漏洞检测中表现不佳的原因
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-22 DOI: 10.1007/s10664-024-10519-8
Kollin Napier, Tanmay Bhowmik, Zhiqian Chen

With an increase of severity in software vulnerabilities, machine learning models are being adopted to combat this threat. Given the possibilities towards usage of such models, research in this area has introduced various approaches. Although models may differ in performance, there is an overall lack of explainability in understanding how a model learns and predicts. Furthermore, recent research suggests that models perform poorly in detecting vulnerabilities when interpreting source code as text, known as “text-based” models. To help explain this poor performance, we explore the dimensions of explainability. From recent studies on text-based models, we experiment with removal of overlapping features present in training and testing datasets, deemed “cross-cutting”. We conduct scenario experiments removing such “cross-cutting” data and reassessing model performance. Based on the results, we examine how removal of these “cross-cutting” features may affect model performance. Our results show that removal of “cross-cutting” features may provide greater performance of models in general, thus leading to explainable dimensions regarding data dependency and agnostic models. Overall, we conclude that model performance can be improved, and explainable aspects of such models can be identified via empirical analysis of the models’ performance.

随着软件漏洞日益严重,人们开始采用机器学习模型来应对这一威胁。鉴于使用此类模型的可能性,该领域的研究引入了各种方法。虽然模型的性能可能各不相同,但在理解模型如何学习和预测方面总体上缺乏可解释性。此外,最近的研究表明,在将源代码解释为文本(即 "基于文本 "的模型)时,模型在检测漏洞方面表现不佳。为了帮助解释这种糟糕的表现,我们探讨了可解释性的各个维度。根据最近对基于文本的模型的研究,我们尝试删除训练数据集和测试数据集中存在的重叠特征,这些特征被视为 "交叉特征"。我们进行了场景实验,移除此类 "交叉 "数据并重新评估模型性能。根据实验结果,我们研究了移除这些 "交叉 "特征对模型性能的影响。我们的结果表明,移除 "交叉 "特征可能会提高模型的总体性能,从而导致有关数据依赖性和不可知论模型的可解释维度。总之,我们得出的结论是,模型的性能是可以提高的,而且可以通过对模型性能的实证分析来确定这些模型的可解释方面。
{"title":"Explaining poor performance of text-based machine learning models for vulnerability detection","authors":"Kollin Napier, Tanmay Bhowmik, Zhiqian Chen","doi":"10.1007/s10664-024-10519-8","DOIUrl":"https://doi.org/10.1007/s10664-024-10519-8","url":null,"abstract":"<p>With an increase of severity in software vulnerabilities, machine learning models are being adopted to combat this threat. Given the possibilities towards usage of such models, research in this area has introduced various approaches. Although models may differ in performance, there is an overall lack of explainability in understanding how a model learns and predicts. Furthermore, recent research suggests that models perform poorly in detecting vulnerabilities when interpreting source code as text, known as “text-based” models. To help explain this poor performance, we explore the dimensions of explainability. From recent studies on text-based models, we experiment with removal of overlapping features present in training and testing datasets, deemed “cross-cutting”. We conduct scenario experiments removing such “cross-cutting” data and reassessing model performance. Based on the results, we examine how removal of these “cross-cutting” features may affect model performance. Our results show that removal of “cross-cutting” features may provide greater performance of models in general, thus leading to explainable dimensions regarding data dependency and agnostic models. Overall, we conclude that model performance can be improved, and explainable aspects of such models can be identified via empirical analysis of the models’ performance.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Exploring the Limitations of Test Selection Techniques on Graph Neural Networks: An Empirical Study 探索图神经网络测试选择技术的局限性:实证研究
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-22 DOI: 10.1007/s10664-024-10515-y
Xueqi Dang, Yinghua Li, Wei Ma, Yuejun Guo, Qiang Hu, Mike Papadakis, Maxime Cordy, Yves Le Traon

Graph Neural Networks (GNNs) have gained prominence in various domains, such as social network analysis, recommendation systems, and drug discovery, due to their ability to model complex relationships in graph-structured data. GNNs can exhibit incorrect behavior, resulting in severe consequences. Therefore, testing is necessary and pivotal. However, labeling all test inputs for GNNs can be prohibitively costly and time-consuming, especially when dealing with large and complex graphs. In response to these challenges, test selection has emerged as a strategic approach to alleviate labeling expenses. The objective of test selection is to select a subset of tests from the complete test set. While various test selection techniques have been proposed for traditional deep neural networks (DNNs), their adaptation to GNNs presents unique challenges due to the distinctions between DNN and GNN test data. Specifically, DNN test inputs are independent of each other, whereas GNN test inputs (nodes) exhibit intricate interdependencies. Therefore, it remains unclear whether DNN test selection approaches can perform effectively on GNNs. To fill the gap, we conduct an empirical study that systematically evaluates the effectiveness of various test selection methods in the context of GNNs, focusing on three critical aspects: 1) Misclassification detection: selecting test inputs that are more likely to be misclassified; 2) Accuracy estimation: selecting a small set of tests to precisely estimate the accuracy of the whole testing set; 3) Performance enhancement: selecting retraining inputs to improve the GNN accuracy. Our empirical study encompasses 7 graph datasets and 8 GNN models, evaluating 22 test selection approaches. Our study includes not only node classification datasets but also graph classification datasets. Our findings reveal that: 1) In GNN misclassification detection, confidence-based test selection methods, which perform well in DNNs, do not demonstrate the same level of effectiveness; 2) In terms of GNN accuracy estimation, clustering-based methods, while consistently performing better than random selection, provide only slight improvements; 3) Regarding selecting inputs for GNN performance improvement, test selection methods, such as confidence-based and clustering-based test selection methods, demonstrate only slight effectiveness; 4) Concerning performance enhancement, node importance-based test selection methods are not suitable, and in many cases, they even perform worse than random selection.

图神经网络(GNN)能够为图结构数据中的复杂关系建模,因此在社交网络分析、推荐系统和药物发现等多个领域大放异彩。GNN 可能会表现出不正确的行为,从而导致严重后果。因此,测试是必要和关键的。然而,为 GNN 标注所有测试输入可能会耗费大量成本和时间,尤其是在处理大型复杂图形时。为了应对这些挑战,测试选择作为一种战略方法应运而生,以减轻标注费用。测试选择的目的是从完整的测试集中选择一个测试子集。虽然针对传统深度神经网络(DNN)提出了各种测试选择技术,但由于 DNN 和 GNN 测试数据之间的区别,这些技术对 GNN 的适应性提出了独特的挑战。具体来说,DNN 的测试输入是相互独立的,而 GNN 的测试输入(节点)则表现出错综复杂的相互依赖性。因此,目前还不清楚 DNN 测试选择方法能否在 GNN 上有效执行。为了填补这一空白,我们开展了一项实证研究,系统地评估了各种测试选择方法在 GNN 中的有效性,重点关注三个关键方面:1) 误分类检测:选择更有可能被误分类的测试输入;2) 精度估计:选择一小部分测试集来精确估计整个测试集的精度;3) 性能提升:选择再训练输入来提高 GNN 的精度。我们的实证研究包括 7 个图数据集和 8 个 GNN 模型,评估了 22 种测试选择方法。我们的研究不仅包括节点分类数据集,还包括图分类数据集。我们的研究结果表明1)在 GNN 误分类检测方面,在 DNN 中表现良好的基于置信度的测试选择方法并没有表现出同样的效果;2)在 GNN 精度估计方面,基于聚类的方法虽然一直比随机选择方法表现更好,但也只是略有改善;3)在选择输入以提高 GNN 性能方面,基于置信度和聚类的测试选择方法等测试选择方法只是略有成效;4)在性能提升方面,基于节点重要性的测试选择方法并不合适,在很多情况下,它们的表现甚至比随机选择方法更差。
{"title":"Towards Exploring the Limitations of Test Selection Techniques on Graph Neural Networks: An Empirical Study","authors":"Xueqi Dang, Yinghua Li, Wei Ma, Yuejun Guo, Qiang Hu, Mike Papadakis, Maxime Cordy, Yves Le Traon","doi":"10.1007/s10664-024-10515-y","DOIUrl":"https://doi.org/10.1007/s10664-024-10515-y","url":null,"abstract":"<p>Graph Neural Networks (GNNs) have gained prominence in various domains, such as social network analysis, recommendation systems, and drug discovery, due to their ability to model complex relationships in graph-structured data. GNNs can exhibit incorrect behavior, resulting in severe consequences. Therefore, testing is necessary and pivotal. However, labeling all test inputs for GNNs can be prohibitively costly and time-consuming, especially when dealing with large and complex graphs. In response to these challenges, test selection has emerged as a strategic approach to alleviate labeling expenses. The objective of test selection is to select a subset of tests from the complete test set. While various test selection techniques have been proposed for traditional deep neural networks (DNNs), their adaptation to GNNs presents unique challenges due to the distinctions between DNN and GNN test data. Specifically, DNN test inputs are independent of each other, whereas GNN test inputs (nodes) exhibit intricate interdependencies. Therefore, it remains unclear whether DNN test selection approaches can perform effectively on GNNs. To fill the gap, we conduct an empirical study that systematically evaluates the effectiveness of various test selection methods in the context of GNNs, focusing on three critical aspects: <b>1) Misclassification detection</b>: selecting test inputs that are more likely to be misclassified; <b>2) Accuracy estimation</b>: selecting a small set of tests to precisely estimate the accuracy of the whole testing set; <b>3) Performance enhancement</b>: selecting retraining inputs to improve the GNN accuracy. Our empirical study encompasses 7 graph datasets and 8 GNN models, evaluating 22 test selection approaches. Our study includes not only node classification datasets but also graph classification datasets. Our findings reveal that: 1) In GNN misclassification detection, confidence-based test selection methods, which perform well in DNNs, do not demonstrate the same level of effectiveness; 2) In terms of GNN accuracy estimation, clustering-based methods, while consistently performing better than random selection, provide only slight improvements; 3) Regarding selecting inputs for GNN performance improvement, test selection methods, such as confidence-based and clustering-based test selection methods, demonstrate only slight effectiveness; 4) Concerning performance enhancement, node importance-based test selection methods are not suitable, and in many cases, they even perform worse than random selection.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prioritizing test cases for deep learning-based video classifiers 确定基于深度学习的视频分类器测试用例的优先级
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-22 DOI: 10.1007/s10664-024-10520-1
Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, Tegawendé F. Bissyandé

The widespread adoption of video-based applications across various fields highlights their importance in modern software systems. However, in comparison to images or text, labelling video test cases for the purpose of assessing system accuracy can lead to increased expenses due to their temporal structure and larger volume. Test prioritization has emerged as a promising approach to mitigate the labeling cost, which prioritizes potentially misclassified test inputs so that such inputs can be identified earlier with limited time and manual labeling efforts. However, applying existing prioritization techniques to video test cases faces certain limitations: they do not account for the unique temporal information present in video data. Unlike static image datasets that only contain spatial information, video inputs consist of multiple frames that capture the dynamic changes of objects over time. In this paper, we propose VRank, the first test prioritization approach designed specifically for video test inputs. The fundamental idea behind VRank is that video-type tests with a higher probability of being misclassified by the evaluated DNN classifier are considered more likely to reveal faults and will be prioritized higher. To this end, we train a ranking model with the aim of predicting the probability of a given test input being misclassified by a DNN classifier. This prediction relies on four types of generated features: temporal features (TF), video embedding features (EF), prediction features (PF), and uncertainty features (UF). We rank all test inputs in the target test set based on their misclassification probabilities. Videos with a higher likelihood of being misclassified will be prioritized higher. We conducted an empirical evaluation to assess the performance of VRank, involving 120 subjects with both natural and noisy datasets. The experimental results reveal VRank outperforms all compared test prioritization methods, with an average improvement of 5.76%(sim )46.51% on natural datasets and 4.26%(sim )53.56% on noisy datasets.

视频应用广泛应用于各个领域,凸显了其在现代软件系统中的重要性。然而,与图像或文本相比,为评估系统准确性而对视频测试用例进行标注,会因为其时间结构和较大的容量而导致费用增加。测试优先级排序已成为减轻标注成本的一种可行方法,它可对可能被错误分类的测试输入进行优先级排序,以便在有限的时间和人工标注工作中尽早识别出此类输入。然而,将现有的优先级排序技术应用于视频测试用例会面临一定的局限性:它们无法考虑视频数据中独特的时间信息。与只包含空间信息的静态图像数据集不同,视频输入由多个帧组成,可捕捉对象随时间发生的动态变化。在本文中,我们提出了 VRank,这是第一种专为视频测试输入而设计的测试优先级排序方法。VRank 背后的基本思想是,视频类型的测试如果被已评估的 DNN 分类器误分类的概率较高,则被认为更有可能暴露出故障,因此优先级会更高。为此,我们训练了一个排名模型,目的是预测给定测试输入被 DNN 分类器误分类的概率。这种预测依赖于四种类型的生成特征:时间特征 (TF)、视频嵌入特征 (EF)、预测特征 (PF) 和不确定性特征 (UF)。我们根据误分类概率对目标测试集中的所有测试输入进行排序。被误判可能性较高的视频将被优先处理。我们对 VRank 的性能进行了实证评估,共有 120 名受试者参加了自然数据集和噪声数据集的评估。实验结果表明,VRank优于所有比较过的测试优先级排序方法,在自然数据集上平均提高了5.76%(46.51%),在噪声数据集上平均提高了4.26%(53.56%)。
{"title":"Prioritizing test cases for deep learning-based video classifiers","authors":"Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, Tegawendé F. Bissyandé","doi":"10.1007/s10664-024-10520-1","DOIUrl":"https://doi.org/10.1007/s10664-024-10520-1","url":null,"abstract":"<p>The widespread adoption of video-based applications across various fields highlights their importance in modern software systems. However, in comparison to images or text, labelling video test cases for the purpose of assessing system accuracy can lead to increased expenses due to their temporal structure and larger volume. Test prioritization has emerged as a promising approach to mitigate the labeling cost, which prioritizes potentially misclassified test inputs so that such inputs can be identified earlier with limited time and manual labeling efforts. However, applying existing prioritization techniques to video test cases faces certain limitations: they do not account for the unique temporal information present in video data. Unlike static image datasets that only contain spatial information, video inputs consist of multiple frames that capture the dynamic changes of objects over time. In this paper, we propose VRank, the first test prioritization approach designed specifically for video test inputs. The fundamental idea behind VRank is that video-type tests with a higher probability of being misclassified by the evaluated DNN classifier are considered more likely to reveal faults and will be prioritized higher. To this end, we train a ranking model with the aim of predicting the probability of a given test input being misclassified by a DNN classifier. This prediction relies on four types of generated features: temporal features (TF), video embedding features (EF), prediction features (PF), and uncertainty features (UF). We rank all test inputs in the target test set based on their misclassification probabilities. Videos with a higher likelihood of being misclassified will be prioritized higher. We conducted an empirical evaluation to assess the performance of VRank, involving 120 subjects with both natural and noisy datasets. The experimental results reveal VRank outperforms all compared test prioritization methods, with an average improvement of 5.76%<span>(sim )</span>46.51% on natural datasets and 4.26%<span>(sim )</span>53.56% on noisy datasets.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Does using Bazel help speed up continuous integration builds? 使用 Bazel 是否有助于加快持续集成构建?
IF 4.1 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-19 DOI: 10.1007/s10664-024-10497-x
Shenyu Zheng, Bram Adams, Ahmed E. Hassan

A long continuous integration (CI) build forces developers to wait for CI feedback before starting subsequent development activities, leading to time wasted. In addition to a variety of build scheduling and test selection heuristics studied in the past, new artifact-based build technologies like Bazel have built-in support for advanced performance optimizations such as parallel build and incremental build (caching of build results). However, little is known about the extent to which new build technologies like Bazel deliver on their promised benefits, especially for long-build duration projects. In this study, we collected 383 Bazel projects from GitHub, then studied their parallel and incremental build usage of Bazel in popular CI services (GitHub Actions, CircleCI, Travis CI, or Buildkite), and compared the results with Maven projects. We conducted 3,500 experiments on 383 Bazel projects and analyzed the build logs of a subset of 70 buildable projects to evaluate the performance impact of Bazel’s parallel builds. Additionally, we performed 102,232 experiments on the 70 buildable projects’ last 100 commits to evaluate Bazel’s incremental build performance. Our results show that 31.23% of Bazel projects adopt a CI service but do not use Bazel in the CI service, while for those who do use Bazel in CI, 27.76% of them use other tools to facilitate Bazel’s execution. Compared to sequential builds, the median speedups for long-build duration projects are 2.00x, 3.84x, 7.36x, and 12.80x, at parallelism degrees 2, 4, 8, and 16, respectively, even though, compared to a clean build, applying incremental build achieves a median speedup of 4.22x (with a build system tool-independent CI cache) and 4.71x (with a build system tool-specific cache) for long-build duration projects. Our results provide guidance for developers to improve the usage of Bazel in their projects, and emphasize the importance of exploring modern build systems due to the current lack of literature and their potential advantages within contemporary software practices such as cloud computing and microservice.

冗长的持续集成(CI)构建迫使开发人员在开始后续开发活动之前等待 CI 反馈,从而造成时间浪费。除了过去研究过的各种构建调度和测试选择启发式方法外,Bazel 等基于工件的新构建技术还内置了对并行构建和增量构建(缓存构建结果)等高级性能优化的支持。然而,人们对 Bazel 等新构建技术在多大程度上实现了其承诺的优势却知之甚少,尤其是在构建持续时间较长的项目中。在本研究中,我们从 GitHub 收集了 383 个 Bazel 项目,然后研究了它们在流行的 CI 服务(GitHub Actions、CircleCI、Travis CI 或 Buildkite)中使用 Bazel 进行并行和增量构建的情况,并将结果与 Maven 项目进行了比较。我们在 383 个 Bazel 项目上进行了 3,500 次实验,并分析了 70 个可构建项目子集的构建日志,以评估 Bazel 并行构建对性能的影响。此外,我们还对 70 个可构建项目的最后 100 次提交进行了 102,232 次实验,以评估 Bazel 的增量构建性能。结果显示,31.23% 的 Bazel 项目采用了 CI 服务,但并未在 CI 服务中使用 Bazel,而在 CI 中使用 Bazel 的项目中,27.76% 的项目使用了其他工具来促进 Bazel 的执行。与顺序构建相比,在并行度为 2、4、8 和 16 时,长构建持续时间项目的速度提升中位数分别为 2.00 倍、3.84 倍、7.36 倍和 12.80 倍,尽管与简洁构建相比,应用增量构建可使长构建持续时间项目的速度提升中位数达到 4.22 倍(使用独立于构建系统工具的 CI 缓存)和 4.71 倍(使用特定于构建系统工具的缓存)。我们的研究结果为开发人员提高 Bazel 在其项目中的使用率提供了指导,并强调了探索现代构建系统的重要性,因为目前缺乏相关文献,而且现代构建系统在云计算和微服务等当代软件实践中具有潜在优势。
{"title":"Does using Bazel help speed up continuous integration builds?","authors":"Shenyu Zheng, Bram Adams, Ahmed E. Hassan","doi":"10.1007/s10664-024-10497-x","DOIUrl":"https://doi.org/10.1007/s10664-024-10497-x","url":null,"abstract":"<p>A long continuous integration (CI) build forces developers to wait for CI feedback before starting subsequent development activities, leading to time wasted. In addition to a variety of build scheduling and test selection heuristics studied in the past, new artifact-based build technologies like Bazel have built-in support for advanced performance optimizations such as parallel build and incremental build (caching of build results). However, little is known about the extent to which new build technologies like Bazel deliver on their promised benefits, especially for long-build duration projects. In this study, we collected 383 Bazel projects from GitHub, then studied their parallel and incremental build usage of Bazel in popular CI services (GitHub Actions, CircleCI, Travis CI, or Buildkite), and compared the results with Maven projects. We conducted 3,500 experiments on 383 Bazel projects and analyzed the build logs of a subset of 70 buildable projects to evaluate the performance impact of Bazel’s parallel builds. Additionally, we performed 102,232 experiments on the 70 buildable projects’ last 100 commits to evaluate Bazel’s incremental build performance. Our results show that 31.23% of Bazel projects adopt a CI service but do not use Bazel in the CI service, while for those who do use Bazel in CI, 27.76% of them use other tools to facilitate Bazel’s execution. Compared to sequential builds, the median speedups for long-build duration projects are 2.00x, 3.84x, 7.36x, and 12.80x, at parallelism degrees 2, 4, 8, and 16, respectively, even though, compared to a clean build, applying incremental build achieves a median speedup of 4.22x (with a build system tool-independent CI cache) and 4.71x (with a build system tool-specific cache) for long-build duration projects. Our results provide guidance for developers to improve the usage of Bazel in their projects, and emphasize the importance of exploring modern build systems due to the current lack of literature and their potential advantages within contemporary software practices such as cloud computing and microservice.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Empirical Software Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1