首页 > 最新文献

IEEE Transactions on Software Engineering最新文献

英文 中文
GenMorph: Automatically Generating Metamorphic Relations via Genetic Programming GenMorph:通过遗传编程自动生成变形关系
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-03-31 DOI: 10.1109/TSE.2024.3407840
Jon Ayerdi;Valerio Terragni;Gunel Jahangirova;Aitor Arrieta;Paolo Tonella
Metamorphic testing is a popular approach that aims to alleviate the oracle problem in software testing. At the core of this approach are Metamorphic Relations (MRs), specifying properties that hold among multiple test inputs and corresponding outputs. Deriving MRs is mostly a manual activity, since their automated generation is a challenging and largely unexplored problem. This paper presents GenMorph, a technique to automatically generate MRs for Java methods that involve inputs and outputs that are boolean, numerical, or ordered sequences. GenMorph uses an evolutionary algorithm to search for effective test oracles, i.e., oracles that trigger no false alarms and expose software faults in the method under test. The proposed search algorithm is guided by two fitness functions that measure the number of false alarms and the number of missed faults for the generated MRs. Our results show that GenMorph generates effective MRs for 18 out of 23 methods (mutation score > 20%). Furthermore, it can increase Randoop's fault detection capability in 7 out of 23 methods, and Evosuite's in 14 out of 23 methods. When compared with AutoMR, a state-of-the-art MR generator, GenMorph also outperformed its fault detection capability in 9 out of 10 methods.
元测试是一种流行的方法,旨在缓解软件测试中的甲骨文问题。这种方法的核心是 "变形关系"(Metamorphic Relations,MRs),它规定了多个测试输入和相应输出之间的属性。由于自动生成 MRs 是一个极具挑战性的问题,而且在很大程度上尚未被探索,因此推导 MRs 主要是一项人工活动。本文介绍的 GenMorph 是一种为涉及布尔、数值或有序序列的输入和输出的 Java 方法自动生成 MR 的技术。GenMorph 使用进化算法搜索有效的测试谕令,即不会触发误报并能暴露被测方法中软件故障的谕令。所提出的搜索算法由两个适应度函数指导,这两个适应度函数分别用于测量误报数量和生成的磁共振漏检故障数量。结果表明,GenMorph 为 23 种方法中的 18 种生成了有效的 MR(突变分数大于 20%)。此外,它还能提高 23 种方法中 7 种方法的 Randoop 故障检测能力,以及 23 种方法中 14 种方法的 Evosuite 故障检测能力。与最先进的磁共振生成器 AutoMR 相比,GenMorph 在 10 种方法中的 9 种方法的故障检测能力也优于 AutoMR。
{"title":"GenMorph: Automatically Generating Metamorphic Relations via Genetic Programming","authors":"Jon Ayerdi;Valerio Terragni;Gunel Jahangirova;Aitor Arrieta;Paolo Tonella","doi":"10.1109/TSE.2024.3407840","DOIUrl":"10.1109/TSE.2024.3407840","url":null,"abstract":"Metamorphic testing is a popular approach that aims to alleviate the oracle problem in software testing. At the core of this approach are Metamorphic Relations (MRs), specifying properties that hold among multiple test inputs and corresponding outputs. Deriving MRs is mostly a manual activity, since their automated generation is a challenging and largely unexplored problem. This paper presents \u0000<sc>GenMorph</small>\u0000, a technique to automatically generate MRs for Java methods that involve inputs and outputs that are boolean, numerical, or ordered sequences. \u0000<sc>GenMorph</small>\u0000 uses an evolutionary algorithm to search for \u0000<italic>effective</i>\u0000 test oracles, i.e., oracles that trigger no false alarms and expose software faults in the method under test. The proposed search algorithm is guided by two fitness functions that measure the number of false alarms and the number of missed faults for the generated MRs. Our results show that \u0000<sc>GenMorph</small>\u0000 generates effective MRs for 18 out of 23 methods (mutation score > 20%). Furthermore, it can increase \u0000<sc>Randoop</small>\u0000's fault detection capability in 7 out of 23 methods, and \u0000<sc>Evosuite</small>\u0000's in 14 out of 23 methods. When compared with \u0000<sc>AutoMR</small>\u0000, a state-of-the-art MR generator, \u0000<sc>GenMorph</small>\u0000 also outperformed its fault detection capability in 9 out of 10 methods.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141185208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reducing the Length of Field-Replay Based Load Testing 缩短基于现场重放的负载测试时间
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-03-31 DOI: 10.1109/TSE.2024.3408079
Yuanjie Xia;Lizhi Liao;Jinfu Chen;Heng Li;Weiyi Shang
As software systems continuously grow in size and complexity, performance and load related issues have become more common than functional issues. Load testing is usually performed before software releases to ensure that the software system can still provide quality service under a certain load. Therefore, one of the common challenges of load testing is to design realistic workloads that can represent the actual workload in the field. In particular, one of the most widely adopted and intuitive approaches is to directly replay the field workloads in the load testing environment. However, replaying a lengthy, e.g., 48 hours, field workloads is rather resource- and time-consuming, and sometimes even infeasible for large-scale software systems that adopt a rapid release cycle. On the other hand, replaying a short duration of the field workloads may still result in unrealistic load testing. In this work, we propose an automated approach to reduce the length of load testing that is driven by replaying the field workloads. The intuition of our approach is: if the measured performance associated with a particular system behaviour is already stable, we can skip subsequent testing of this system behaviour to reduce the length of the field workloads. In particular, our approach first clusters execution logs that are generated during the system runtime to identify similar system behaviours during the field workloads. Then, we use statistical methods to determine whether the measured performance associated with a system behaviour has been stable. We evaluate our approach on three open-source projects (i.e., OpenMRS, TeaStore, and Apache James). The results show that our approach can significantly reduce the length of field workloads while the workloads-after-reduction produced by our approach are representative of the original set of workloads. More importantly, the load testing results obtained by replaying the workloads after the reduction have high correlation and similar trend with the original set of workloads. Practitioners can leverage our approach to perform realistic field-replay based load testing while saving the needed resources and time. Our approach sheds light on future research that aims to reduce the cost of load testing for large-scale software systems.
随着软件系统规模和复杂性的不断增长,性能和负载相关问题已变得比功能问题更为常见。负载测试通常在软件发布前进行,以确保软件系统在一定负载下仍能提供优质服务。因此,负载测试的常见挑战之一是设计出能代表现场实际工作负载的真实工作负载。其中,最广泛采用的直观方法之一是在负载测试环境中直接重放现场工作负载。然而,重放长时间(如 48 小时)的现场工作负载相当耗费资源和时间,对于采用快速发布周期的大型软件系统来说,有时甚至是不可行的。另一方面,重放短时间的现场工作负载仍可能导致不切实际的负载测试。在这项工作中,我们提出了一种自动化方法,以缩短通过重放现场工作负载驱动的负载测试时间。我们这种方法的直觉是:如果与特定系统行为相关的测量性能已经稳定,我们就可以跳过对该系统行为的后续测试,从而缩短现场工作负载的时间。具体来说,我们的方法首先对系统运行时生成的执行日志进行聚类,以识别现场工作负载中类似的系统行为。然后,我们使用统计方法来确定与系统行为相关的测量性能是否稳定。我们在三个开源项目(即 OpenMRS、TeaStore 和 Apache James)上对我们的方法进行了评估。结果表明,我们的方法可以显著缩短现场工作负载的长度,同时我们的方法所产生的工作负载缩减后与原始工作负载集具有代表性。更重要的是,通过重放缩减后的工作负载获得的负载测试结果与原始工作负载集具有高度相关性和相似趋势。从业人员可以利用我们的方法来执行基于现场重放的真实负载测试,同时节省所需的资源和时间。我们的方法为未来旨在降低大规模软件系统负载测试成本的研究提供了启示。
{"title":"Reducing the Length of Field-Replay Based Load Testing","authors":"Yuanjie Xia;Lizhi Liao;Jinfu Chen;Heng Li;Weiyi Shang","doi":"10.1109/TSE.2024.3408079","DOIUrl":"10.1109/TSE.2024.3408079","url":null,"abstract":"As software systems continuously grow in size and complexity, performance and load related issues have become more common than functional issues. Load testing is usually performed before software releases to ensure that the software system can still provide quality service under a certain load. Therefore, one of the common challenges of load testing is to design realistic workloads that can represent the actual workload in the field. In particular, one of the most widely adopted and intuitive approaches is to directly replay the field workloads in the load testing environment. However, replaying a lengthy, e.g., 48 hours, field workloads is rather resource- and time-consuming, and sometimes even infeasible for large-scale software systems that adopt a rapid release cycle. On the other hand, replaying a short duration of the field workloads may still result in unrealistic load testing. In this work, we propose an automated approach to reduce the length of load testing that is driven by replaying the field workloads. The intuition of our approach is: if the measured performance associated with a particular system behaviour is already stable, we can skip subsequent testing of this system behaviour to reduce the length of the field workloads. In particular, our approach first clusters execution logs that are generated during the system runtime to identify similar system behaviours during the field workloads. Then, we use statistical methods to determine whether the measured performance associated with a system behaviour has been stable. We evaluate our approach on three open-source projects (i.e., \u0000<italic>OpenMRS</i>\u0000, \u0000<italic>TeaStore</i>\u0000, and \u0000<italic>Apache James</i>\u0000). The results show that our approach can significantly reduce the length of field workloads while the workloads-after-reduction produced by our approach are representative of the original set of workloads. More importantly, the load testing results obtained by replaying the workloads after the reduction have high correlation and similar trend with the original set of workloads. Practitioners can leverage our approach to perform realistic field-replay based load testing while saving the needed resources and time. Our approach sheds light on future research that aims to reduce the cost of load testing for large-scale software systems.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141185318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DAppSCAN: Building Large-Scale Datasets for Smart Contract Weaknesses in DApp Projects DAppSCAN:为 DApp 项目中的智能合约弱点构建大规模数据集
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-03-29 DOI: 10.1109/TSE.2024.3383422
Zibin Zheng;Jianzhong Su;Jiachi Chen;David Lo;Zhijie Zhong;Mingxi Ye
The Smart Contract Weakness Classification Registry (SWC Registry) is a widely recognized list of smart contract weaknesses specific to the Ethereum platform. Despite the SWC Registry not being updated with new entries since 2020, the sustained development of smart contract analysis tools for detecting SWC-listed weaknesses highlights their ongoing significance in the field. However, evaluating these tools has proven challenging due to the absence of a large, unbiased, real-world dataset. To address this problem, we aim to build a large-scale SWC weakness dataset from real-world DApp projects. We recruited 22 participants and spent 44 person-months analyzing 1,199 open-source audit reports from 29 security teams. In total, we identified 9,154 weaknesses and developed two distinct datasets, i.e., DAppSCAN-Source and DAppSCAN-Bytecode. The DAppSCAN-Source dataset comprises 39,904 Solidity files, featuring 1,618 SWC weaknesses sourced from 682 real-world DApp projects. However, the Solidity files in this dataset may not be directly compilable for further analysis. To facilitate automated analysis, we developed a tool capable of automatically identifying dependency relationships within DApp projects and completing missing public libraries. Using this tool, we created DAppSCAN-Bytecode dataset, which consists of 6,665 compiled smart contract with 888 SWC weaknesses. Based on DAppSCAN-Bytecode, we conducted an empirical study to evaluate the performance of state-of-the-art smart contract weakness detection tools. The evaluation results revealed sub-par performance for these tools in terms of both effectiveness and success detection rate, indicating that future development should prioritize real-world datasets over simplistic toy contracts.
智能合约弱点分类注册表(SWC Registry)是一份广受认可的以太坊平台专用智能合约弱点列表。尽管 SWC 注册表自 2020 年以来就没有更新过新的条目,但用于检测 SWC 所列弱点的智能合约分析工具的持续开发,凸显了其在该领域的持续重要性。然而,由于缺乏大型、无偏见的真实数据集,对这些工具进行评估已被证明具有挑战性。为了解决这个问题,我们旨在从真实世界的 DApp 项目中建立一个大规模的 SWC 弱点数据集。我们招募了 22 名参与者,花费 44 个人月的时间分析了来自 29 个安全团队的 1,199 份开源审计报告。我们总共发现了 9154 个弱点,并开发了两个不同的数据集,即 DAppSCAN-Source 和 DAppSCAN-Bytecode。DAppSCAN-Source 数据集包含 39,904 个 Solidity 文件,其中 1,618 个 SWC 弱点来自 682 个真实 DApp 项目。不过,该数据集中的 Solidity 文件可能无法直接编译以进行进一步分析。为了便于自动分析,我们开发了一种工具,能够自动识别 DApp 项目中的依赖关系,并补全缺失的公共库。利用该工具,我们创建了 DAppSCAN-Bytecode 数据集,其中包括 6665 个已编译的智能合约和 888 个 SWC 弱点。基于 DAppSCAN-Bytecode,我们开展了一项实证研究,以评估最先进的智能合约弱点检测工具的性能。评估结果表明,这些工具在有效性和成功检测率方面的表现都不尽如人意,这表明未来的开发工作应优先考虑真实世界的数据集,而不是简单的玩具合约。
{"title":"DAppSCAN: Building Large-Scale Datasets for Smart Contract Weaknesses in DApp Projects","authors":"Zibin Zheng;Jianzhong Su;Jiachi Chen;David Lo;Zhijie Zhong;Mingxi Ye","doi":"10.1109/TSE.2024.3383422","DOIUrl":"10.1109/TSE.2024.3383422","url":null,"abstract":"The Smart Contract Weakness Classification Registry (SWC Registry) is a widely recognized list of smart contract weaknesses specific to the Ethereum platform. Despite the SWC Registry not being updated with new entries since 2020, the sustained development of smart contract analysis tools for detecting SWC-listed weaknesses highlights their ongoing significance in the field. However, evaluating these tools has proven challenging due to the absence of a large, unbiased, real-world dataset. To address this problem, we aim to build a large-scale SWC weakness dataset from real-world DApp projects. We recruited 22 participants and spent 44 person-months analyzing 1,199 open-source audit reports from 29 security teams. In total, we identified 9,154 weaknesses and developed two distinct datasets, i.e., \u0000<sc>DAppSCAN-Source</small>\u0000 and \u0000<sc>DAppSCAN-Bytecode</small>\u0000. The \u0000<sc>DAppSCAN-Source</small>\u0000 dataset comprises 39,904 Solidity files, featuring 1,618 SWC weaknesses sourced from 682 real-world DApp projects. However, the Solidity files in this dataset may not be directly compilable for further analysis. To facilitate automated analysis, we developed a tool capable of automatically identifying dependency relationships within DApp projects and completing missing public libraries. Using this tool, we created \u0000<sc>DAppSCAN-Bytecode</small>\u0000 dataset, which consists of 6,665 compiled smart contract with 888 SWC weaknesses. Based on \u0000<sc>DAppSCAN-Bytecode</small>\u0000, we conducted an empirical study to evaluate the performance of state-of-the-art smart contract weakness detection tools. The evaluation results revealed sub-par performance for these tools in terms of both effectiveness and success detection rate, indicating that future development should prioritize real-world datasets over simplistic toy contracts.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140329104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel 评估 SZZ 实现:Linux 内核实证研究
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-03-29 DOI: 10.1109/TSE.2024.3406718
Yunbo Lyu;Hong Jin Kang;Ratnadira Widyasari;Julia Lawall;David Lo
The SZZ algorithm is used to connect bug-fixing commits to the earlier commits that introduced bugs. This algorithm has many applications and many variants have been devised. However, there are some types of commits that cannot be traced by the SZZ algorithm, referred to as “ghost commits”. The evaluation of how these ghost commits impact the SZZ implementations remains limited. Moreover, these implementations have been evaluated on datasets created by software engineering researchers from information in bug trackers and version controlled histories. Since Oct 2013, the Linux kernel developers have started labelling bug-fixing patches with the commit identifiers of the corresponding bug-inducing commit(s) as a standard practice. As of v6.1-rc5, 76,046 pairs of bug-fixing patches and bug-inducing commits are available. This provides a unique opportunity to evaluate the SZZ algorithm on a large dataset that has been created and reviewed by project developers, entirely independently of the biases of software engineering researchers. In this paper, we apply six SZZ implementations to 76,046 pairs of bug-fixing patches and bug-introducing commits from the Linux kernel. Our findings reveal that SZZ algorithms experience a more significant decline in recall on our dataset ($downarrow 13.8%$) as compared to prior findings reported by Rosa et al., and the disparities between the individual SZZ algorithms diminish. Moreover, we find that 17.47% of bug-fixing commits are ghost commits. Finally, we propose Tracing-Commit SZZ (TC-SZZ), that traces all commits in the change history of lines modified or deleted in bug-fixing commits. Applying TC-SZZ to all failure cases, excluding ghost commits, we found that TC-SZZ could identify 17.7% of them. Our further analysis based on git log found that 34.6% of bug-inducing commits were in the function history, 27.5% in the file history (but not in the function history), and 37.9% not in the file history. We further evaluated the effectiveness of ChatGPT in boosting the SZZ algorithm's ability to identify bug-inducing commits in the function history, in the file history and not in the file history.
SZZ 算法用于将修复错误的提交与引入错误的早期提交连接起来。这种算法有很多应用,也有很多变种。然而,SZZ 算法无法追踪某些类型的提交,这些提交被称为 "幽灵提交"。对这些幽灵提交如何影响 SZZ 实现的评估仍然有限。此外,这些实现是在软件工程研究人员根据错误跟踪器和版本控制历史中的信息创建的数据集上进行评估的。自 2013 年 10 月起,Linux 内核开发人员开始在修复漏洞的补丁上标注相应的引发漏洞的提交标识符,并将此作为一种标准做法。截至 v6.1-rc5,已有 76046 对修复漏洞的补丁和引发漏洞的提交。这为我们提供了一个独一无二的机会,在完全不受软件工程研究人员偏见影响的情况下,在由项目开发人员创建和审核的大型数据集上评估 SZZ 算法。在本文中,我们将六种 SZZ 实现应用于 Linux 内核中的 76046 对缺陷修复补丁和缺陷引入提交。我们的研究结果表明,与 Rosa 等人之前的研究结果相比,SZZ 算法在我们的数据集上的召回率出现了更显著的下降($/downarrow 13.8/%$),而且单个 SZZ 算法之间的差异也在缩小。此外,我们还发现有 17.47% 的错误修复提交是幽灵提交。最后,我们提出了追踪-提交 SZZ(TC-SZZ),它可以追踪错误修复提交中修改或删除行的变更历史中的所有提交。将 TC-SZZ 应用于所有失败案例(不包括幽灵提交),我们发现 TC-SZZ 可以识别出 17.7% 的失败案例。基于 git 日志的进一步分析发现,34.6% 的错误诱发提交在函数历史中,27.5% 在文件历史中(但不在函数历史中),37.9% 不在文件历史中。我们进一步评估了 ChatGPT 在提高 SZZ 算法识别函数历史、文件历史和非文件历史中的错误诱导提交能力方面的有效性。
{"title":"Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel","authors":"Yunbo Lyu;Hong Jin Kang;Ratnadira Widyasari;Julia Lawall;David Lo","doi":"10.1109/TSE.2024.3406718","DOIUrl":"10.1109/TSE.2024.3406718","url":null,"abstract":"The SZZ algorithm is used to connect bug-fixing commits to the earlier commits that introduced bugs. This algorithm has many applications and many variants have been devised. However, there are some types of commits that cannot be traced by the SZZ algorithm, referred to as “ghost commits”. The evaluation of how these ghost commits impact the SZZ implementations remains limited. Moreover, these implementations have been evaluated on datasets created by software engineering researchers from information in bug trackers and version controlled histories. Since Oct 2013, the Linux kernel developers have started labelling bug-fixing patches with the commit identifiers of the corresponding bug-inducing commit(s) as a standard practice. As of v6.1-rc5, 76,046 pairs of bug-fixing patches and bug-inducing commits are available. This provides a unique opportunity to evaluate the SZZ algorithm on a large dataset that has been created and reviewed by project developers, entirely independently of the biases of software engineering researchers. In this paper, we apply six SZZ implementations to 76,046 pairs of bug-fixing patches and bug-introducing commits from the Linux kernel. Our findings reveal that SZZ algorithms experience a more significant decline in recall on our dataset (\u0000<inline-formula><tex-math>$downarrow 13.8%$</tex-math></inline-formula>\u0000) as compared to prior findings reported by Rosa et al., and the disparities between the individual SZZ algorithms diminish. Moreover, we find that 17.47% of bug-fixing commits are ghost commits. Finally, we propose Tracing-Commit SZZ (TC-SZZ), that traces all commits in the change history of lines modified or deleted in bug-fixing commits. Applying TC-SZZ to all failure cases, excluding ghost commits, we found that TC-SZZ could identify 17.7% of them. Our further analysis based on \u0000<i>git log</i>\u0000 found that 34.6% of bug-inducing commits were in the function history, 27.5% in the file history (but not in the function history), and 37.9% not in the file history. We further evaluated the effectiveness of ChatGPT in boosting the SZZ algorithm's ability to identify bug-inducing commits in the function history, in the file history and not in the file history.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141177579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation ChatGPT 与 SBST:单元测试套件生成的比较评估
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-03-29 DOI: 10.1109/TSE.2024.3382365
Yutian Tang;Zhijie Liu;Zhichao Zhou;Xiapu Luo
Recent advancements in large language models (LLMs) have demonstrated exceptional success in a wide range of general domain tasks, such as question answering and following instructions. Moreover, LLMs have shown potential in various software engineering applications. In this study, we present a systematic comparison of test suites generated by the ChatGPT LLM and the state-of-the-art SBST tool EvoSuite. Our comparison is based on several critical factors, including correctness, readability, code coverage, and bug detection capability. By highlighting the strengths and weaknesses of LLMs (specifically ChatGPT) in generating unit test cases compared to EvoSuite, this work provides valuable insights into the performance of LLMs in solving software engineering problems. Overall, our findings underscore the potential of LLMs in software engineering and pave the way for further research in this area.
大型语言模型(LLM)的最新进展表明,它在问题解答和遵从指令等广泛的通用领域任务中取得了非凡的成功。此外,LLM 在各种软件工程应用中也显示出了潜力。在本研究中,我们对 ChatGPT LLM 和最先进的 SBST 工具 EvoSuite 生成的测试套件进行了系统比较。我们的比较基于几个关键因素,包括正确性、可读性、代码覆盖率和错误检测能力。通过强调 LLM(特别是 ChatGPT)与 EvoSuite 相比在生成单元测试用例方面的优缺点,这项工作为了解 LLM 在解决软件工程问题方面的性能提供了宝贵的见解。总之,我们的研究结果强调了 LLM 在软件工程中的潜力,并为这一领域的进一步研究铺平了道路。
{"title":"ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation","authors":"Yutian Tang;Zhijie Liu;Zhichao Zhou;Xiapu Luo","doi":"10.1109/TSE.2024.3382365","DOIUrl":"10.1109/TSE.2024.3382365","url":null,"abstract":"Recent advancements in large language models (LLMs) have demonstrated exceptional success in a wide range of general domain tasks, such as question answering and following instructions. Moreover, LLMs have shown potential in various software engineering applications. In this study, we present a systematic comparison of test suites generated by the ChatGPT LLM and the state-of-the-art SBST tool EvoSuite. Our comparison is based on several critical factors, including correctness, readability, code coverage, and bug detection capability. By highlighting the strengths and weaknesses of LLMs (specifically ChatGPT) in generating unit test cases compared to EvoSuite, this work provides valuable insights into the performance of LLMs in solving software engineering problems. Overall, our findings underscore the potential of LLMs in software engineering and pave the way for further research in this area.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140329247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated Code Editing With Search-Generate-Modify 利用搜索-生成-修改功能自动编辑代码
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-03-27 DOI: 10.1109/TSE.2024.3376387
Changshu Liu;Pelin Cetin;Yogesh Patodia;Baishakhi Ray;Saikat Chakraborty;Yangruibo Ding
Code editing is essential in evolving software development. In literature, several automated code editing tools are proposed, which leverage Information Retrieval-based techniques and Machine Learning-based code generation and code editing models. Each technique comes with its own promises and perils, and for this reason, they are often used together to complement their strengths and compensate for their weaknesses. This paper proposes a hybrid approach to better synthesize code edits by leveraging the power of code search, generation, and modification. Our key observation is that a patch that is obtained by search & retrieval, even if incorrect, can provide helpful guidance to a code generation model. However, a retrieval-guided patch produced by a code generation model can still be a few tokens off from the intended patch. Such generated patches can be slightly modified to create the intended patches. We developed a novel tool to solve this challenge: SarGaM, which is designed to follow a real developer's code editing behavior. Given an original code version, the developer may search for the related patches, generate or write the code, and then modify the generated code to adapt it to the right context. Our evaluation of SarGaM on edit generation shows superior performance w.r.t. the current state-of-the-art techniques. SarGaM also shows its effectiveness on automated program repair tasks.
在不断发展的软件开发过程中,代码编辑是必不可少的。文献中提出了几种自动代码编辑工具,它们利用了基于信息检索的技术和基于机器学习的代码生成和代码编辑模型。每种技术都有自己的优势和劣势,因此,它们经常被结合使用,以取长补短。本文提出了一种混合方法,通过利用代码搜索、生成和修改的力量,更好地合成代码编辑。我们的主要观点是,通过搜索和检索获得的补丁即使不正确,也能为代码生成模型提供有益的指导。然而,由代码生成模型生成的以检索为指导的补丁仍可能与预期补丁相差几个字节。这种生成的补丁可以稍加修改,以创建预期的补丁。我们开发了一种新颖的工具来解决这一难题:SarGaM,它的设计遵循真实开发者的代码编辑行为。在给定原始代码版本的情况下,开发人员可以搜索相关补丁,生成或编写代码,然后修改生成的代码,使其适应正确的上下文。我们对 SarGaM 在编辑生成方面的评估结果表明,与当前最先进的技术相比,SarGaM 的性能更加卓越。SarGaM 还显示了其在自动程序修复任务中的有效性。
{"title":"Automated Code Editing With Search-Generate-Modify","authors":"Changshu Liu;Pelin Cetin;Yogesh Patodia;Baishakhi Ray;Saikat Chakraborty;Yangruibo Ding","doi":"10.1109/TSE.2024.3376387","DOIUrl":"10.1109/TSE.2024.3376387","url":null,"abstract":"Code editing is essential in evolving software development. In literature, several automated code editing tools are proposed, which leverage Information Retrieval-based techniques and Machine Learning-based code generation and code editing models. Each technique comes with its own promises and perils, and for this reason, they are often used together to complement their strengths and compensate for their weaknesses. This paper proposes a hybrid approach to better synthesize code edits by leveraging the power of code search, generation, and modification. Our key observation is that a patch that is obtained by search & retrieval, even if incorrect, can provide helpful guidance to a code generation model. However, a retrieval-guided patch produced by a code generation model can still be a few tokens off from the intended patch. Such generated patches can be slightly modified to create the intended patches. We developed a novel tool to solve this challenge: \u0000<sc>SarGaM</small>\u0000, which is designed to follow a real developer's code editing behavior. Given an original code version, the developer may \u0000<italic>search</i>\u0000 for the related patches, \u0000<italic>generate</i>\u0000 or write the code, and then \u0000<italic>modify</i>\u0000 the generated code to adapt it to the right context. Our evaluation of \u0000<sc>SarGaM</small>\u0000 on edit generation shows superior performance w.r.t. the current state-of-the-art techniques. \u0000<sc>SarGaM</small>\u0000 also shows its effectiveness on automated program repair tasks.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140310614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Language Taint Analysis: Generating Caller-Sensitive Native Code Specification for Java 跨语言污点分析:生成对调用者敏感的 Java 本地代码规范
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-03-27 DOI: 10.1109/TSE.2024.3392254
Shuangxiang Kan;Yuhao Gao;Zexin Zhong;Yulei Sui
Cross-language programming is a common practice within the software development industry, offering developers a multitude of advantages such as expressiveness, interoperability, and cross-platform compatibility, for developing large-scale applications. As an important example, JNI (Java Native Interface) programming is widely used in diverse scenarios where Java interacts with code written in other programming languages, such as C or C++. Conventional static analysis based on a single programming language faces challenges when it comes to tracing the flow of values across multiple modules that are coded in different programming languages. In this paper, we introduce CSS, a new Caller-Sensitive Specification approach designed to enhance the static taint analysis of Java programs employing JNI to interface with C/C++ code. In contrast to conservative specifications, this approach takes into consideration the calling context of the invoked C/C++ functions (or cross-language context), resulting in more precise and concise specifications for the side effects of native code. Furthermore, CSS specifically enhances the capabilities of Java analyzers, enabling them to perform precise static taint analysis across language boundaries into native code. The experimental results show that CSS can accurately summarize value-flow information and enhance the ability of Java monolingual static analyzers for cross-language taint flow tracking.
跨语言编程是软件开发行业的一种常见做法,它为开发人员开发大型应用程序提供了多种优势,如表现力、互操作性和跨平台兼容性。一个重要的例子是,JNI(Java 本地接口)编程被广泛应用于 Java 与其他编程语言(如 C 或 C++)编写的代码交互的各种场景中。传统的静态分析以单一编程语言为基础,在跟踪跨多个以不同编程语言编码的模块的数值流时面临挑战。在本文中,我们介绍了一种新的调用者敏感规范(Caller-Sensitive Specification)方法--CSS,该方法旨在加强对使用 JNI 与 C/C++ 代码接口的 Java 程序的静态污点分析。与保守的规范不同,这种方法考虑了被调用的 C/C++ 函数的调用上下文(或跨语言上下文),从而为本地代码的副作用提供了更精确、更简洁的规范。此外,CSS 还特别增强了 Java 分析器的能力,使它们能够跨语言边界对本地代码执行精确的静态污点分析。实验结果表明,CSS 可以准确总结值流信息,并增强 Java 单语言静态分析器的跨语言污点流跟踪能力。
{"title":"Cross-Language Taint Analysis: Generating Caller-Sensitive Native Code Specification for Java","authors":"Shuangxiang Kan;Yuhao Gao;Zexin Zhong;Yulei Sui","doi":"10.1109/TSE.2024.3392254","DOIUrl":"10.1109/TSE.2024.3392254","url":null,"abstract":"Cross-language programming is a common practice within the software development industry, offering developers a multitude of advantages such as expressiveness, interoperability, and cross-platform compatibility, for developing large-scale applications. As an important example, JNI (Java Native Interface) programming is widely used in diverse scenarios where Java interacts with code written in other programming languages, such as C or C++. Conventional static analysis based on a single programming language faces challenges when it comes to tracing the flow of values across multiple modules that are coded in different programming languages. In this paper, we introduce CSS, a new \u0000<italic>Caller-Sensitive Specification</i>\u0000 approach designed to enhance the static taint analysis of Java programs employing JNI to interface with C/C++ code. In contrast to conservative specifications, this approach takes into consideration the calling context of the invoked C/C++ functions (or cross-language context), resulting in more precise and concise specifications for the side effects of native code. Furthermore, CSS specifically enhances the capabilities of Java analyzers, enabling them to perform precise static taint analysis across language boundaries into native code. The experimental results show that CSS can accurately summarize value-flow information and enhance the ability of Java monolingual static analyzers for cross-language taint flow tracking.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141159439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MASTER: Multi-Source Transfer Weighted Ensemble Learning for Multiple Sources Cross-Project Defect Prediction MASTER:用于多源跨项目缺陷预测的多源转移加权集合学习
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-03-25 DOI: 10.1109/TSE.2024.3381235
Haonan Tong;Dalin Zhang;Jiqiang Liu;Weiwei Xing;Lingyun Lu;Wei Lu;Yumei Wu
Multi-source cross-project defect prediction (MSCPDP) attempts to transfer defect knowledge learned from multiple source projects to the target project. MSCPDP has drawn increasing attention from academic and industry communities owing to its advantages compared with single-source cross-project defect prediction (SSCPDP). However, two main problems, which are how to effectively extract the transferable knowledge from each source dataset and how to measure the amount of knowledge transferred from each source dataset to the target dataset, seriously restrict the performance of existing MSCPDP models. In this paper, we propose a novel multi-source transfer weighted ensemble learning (MASTER) method for MSCPDP. MASTER measures the weight of each source dataset based on feature importance and distribution difference and then extracts the transferable knowledge based on the proposed feature-weighted transfer learning algorithm. Experiments are performed on 30 software projects. We compare MASTER with the latest state-of-the-art MSCPDP methods with statistical test in terms of famous effort-unaware measures (i.e., PD, PF, AUC, and MCC) and two widely used effort-aware measures ($P_{opt}20%$ and IFA). The experiment results show that: 1) MASTER can substantially improve the prediction performance compared with the baselines, e.g., an improvement of at least 49.1% in MCC, 48.1% in IFA; 2) MASTER significantly outperforms each baseline on most datasets in terms of AUC, MCC, $P_{opt}20%$ and IFA; 3) MSCPDP model significantly performs better than the mean case of SSCPDP model on most datasets and even outperforms the best case of SSCPDP on some datasets. It can be concluded that 1) it is very necessary to conduct MSCPDP, and 2) the proposed MASTER is a more promising alternative for MSCPDP.
多源跨项目缺陷预测(MSCPDP)试图将从多个源项目中学到的缺陷知识转移到目标项目中。与单源跨项目缺陷预测(SSCPDP)相比,MSCPDP 的优势日益受到学术界和工业界的关注。然而,如何有效地从每个源数据集中提取可转移知识以及如何测量从每个源数据集转移到目标数据集的知识量这两个主要问题严重制约了现有 MSCPDP 模型的性能。在本文中,我们为 MSCPDP 提出了一种新颖的多源转移加权集合学习(MASTER)方法。MASTER 根据特征的重要性和分布差异来衡量每个源数据集的权重,然后根据提出的特征加权转移学习算法提取可转移的知识。我们在 30 个软件项目上进行了实验。我们将 MASTER 与最新的最先进的 MSCPDP 方法进行了比较,并在著名的不感知努力度量(即 PD、PF、AUC 和 MCC)和两种广泛使用的感知努力度量($P_{opt}20%$ 和 IFA)方面进行了统计检验。实验结果表明1)与基线相比,MASTER 可以大幅提高预测性能,例如,MCC 至少提高了 49.1%,IFA 至少提高了 48.1%;2)在大多数数据集上,MASTER 在 AUC、MCC、$P_{opt}20%$ 和 IFA 方面的表现明显优于各基线;3)在大多数数据集上,MSCPDP 模型的表现明显优于 SSCPDP 模型的平均值,在某些数据集上甚至优于 SSCPDP 的最佳值。由此可以得出结论:1)进行 MSCPDP 非常必要;2)提议的 MASTER 是 MSCPDP 更有前途的替代方案。
{"title":"MASTER: Multi-Source Transfer Weighted Ensemble Learning for Multiple Sources Cross-Project Defect Prediction","authors":"Haonan Tong;Dalin Zhang;Jiqiang Liu;Weiwei Xing;Lingyun Lu;Wei Lu;Yumei Wu","doi":"10.1109/TSE.2024.3381235","DOIUrl":"10.1109/TSE.2024.3381235","url":null,"abstract":"Multi-source cross-project defect prediction (MSCPDP) attempts to transfer defect knowledge learned from multiple source projects to the target project. MSCPDP has drawn increasing attention from academic and industry communities owing to its advantages compared with single-source cross-project defect prediction (SSCPDP). However, two main problems, which are how to effectively extract the transferable knowledge from each source dataset and how to measure the amount of knowledge transferred from each source dataset to the target dataset, seriously restrict the performance of existing MSCPDP models. In this paper, we propose a novel \u0000<b>m</b>\u0000ulti-source tr\u0000<b>a</b>\u0000n\u0000<b>s</b>\u0000fer weigh\u0000<b>t</b>\u0000ed \u0000<b>e</b>\u0000nsemble lea\u0000<b>r</b>\u0000ning (MASTER) method for MSCPDP. MASTER measures the weight of each source dataset based on feature importance and distribution difference and then extracts the transferable knowledge based on the proposed feature-weighted transfer learning algorithm. Experiments are performed on 30 software projects. We compare MASTER with the latest state-of-the-art MSCPDP methods with statistical test in terms of famous effort-unaware measures (i.e., PD, PF, AUC, and MCC) and two widely used effort-aware measures (\u0000<inline-formula><tex-math>$P_{opt}20%$</tex-math></inline-formula>\u0000 and IFA). The experiment results show that: 1) MASTER can substantially improve the prediction performance compared with the baselines, e.g., an improvement of at least 49.1% in MCC, 48.1% in IFA; 2) MASTER significantly outperforms each baseline on most datasets in terms of AUC, MCC, \u0000<inline-formula><tex-math>$P_{opt}20%$</tex-math></inline-formula>\u0000 and IFA; 3) MSCPDP model significantly performs better than the mean case of SSCPDP model on most datasets and even outperforms the best case of SSCPDP on some datasets. It can be concluded that 1) it is very necessary to conduct MSCPDP, and 2) the proposed MASTER is a more promising alternative for MSCPDP.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140291669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding and Detecting Real-World Safety Issues in Rust 了解和检测现实世界中的锈蚀安全问题
IF 7.4 1区 计算机科学 Q1 Computer Science Pub Date : 2024-03-25 DOI: 10.1109/TSE.2024.3380393
Boqin Qin;Yilun Chen;Haopeng Liu;Hua Zhang;Qiaoyan Wen;Linhai Song;Yiying Zhang
Rust is a relatively new programming language designed for systems software development. Its objective is to combine the safety guarantees typically associated with high-level languages with the performance efficiency often found in executable programs implemented in low-level languages. The core design of Rust is a set of strict safety rules enforced through compile-time checks. However, to support more low-level controls, Rust also allows programmers to bypass its compiler checks by writing unsafe code. As the adoption of Rust grows in the development of safety-critical software, it becomes increasingly important to understand what safety issues may elude Rust's compiler checks and manifest in real Rust programs. In this paper, we conduct a comprehensive, empirical study of Rust safety issues by close, manual inspection of 70 memory bugs, 100 concurrency bugs, and 110 programming errors leading to unexpected execution panics from five open-source Rust projects, five widely-used Rust libraries, and two online security databases. Our study answers three important questions: what memory-safety issues real Rust programs have, what concurrency bugs Rust programmers make, and how unexpected panics in Rust programs are caused. Our study reveals interesting real-world Rust program behaviors and highlights new issues made by Rust programmers. Building upon the findings of our study, we design and implement five static detectors. After being applied to the studied Rust programs and another 12 selected Rust projects, our checkers pinpoint 96 previously unknown bugs and report a negligible number of false positives, confirming their effectiveness and the value of our empirical study.
Rust 是一种相对较新的编程语言,专为系统软件开发而设计。它的目标是将通常与高级语言相关的安全保证与通常在用低级语言实现的可执行程序中发现的性能效率结合起来。Rust 的核心设计是通过编译时检查执行一系列严格的安全规则。不过,为了支持更多底层控制,Rust 还允许程序员通过编写不安全代码来绕过编译器检查。随着 Rust 在安全关键型软件开发中的应用越来越广泛,了解哪些安全问题可能会躲过 Rust 的编译器检查,并在实际的 Rust 程序中表现出来变得越来越重要。在本文中,我们通过对五个开源 Rust 项目、五个广泛使用的 Rust 库和两个在线安全数据库中的 70 个内存错误、100 个并发错误和 110 个导致意外执行恐慌的编程错误进行仔细的人工检查,对 Rust 安全问题进行了全面的实证研究。我们的研究回答了三个重要问题:真实的 Rust 程序有哪些内存安全问题,Rust 程序员会制造哪些并发错误,以及 Rust 程序中的意外恐慌是如何引起的。我们的研究揭示了真实世界中有趣的 Rust 程序行为,并强调了 Rust 程序员制造的新问题。在研究结果的基础上,我们设计并实现了五个静态检测器。在应用于所研究的 Rust 程序和另外 12 个选定的 Rust 项目后,我们的检测器精确定位了 96 个以前未知的错误,并报告了可忽略不计的误报,从而证实了它们的有效性和我们实证研究的价值。
{"title":"Understanding and Detecting Real-World Safety Issues in Rust","authors":"Boqin Qin;Yilun Chen;Haopeng Liu;Hua Zhang;Qiaoyan Wen;Linhai Song;Yiying Zhang","doi":"10.1109/TSE.2024.3380393","DOIUrl":"10.1109/TSE.2024.3380393","url":null,"abstract":"Rust is a relatively new programming language designed for systems software development. Its objective is to combine the safety guarantees typically associated with high-level languages with the performance efficiency often found in executable programs implemented in low-level languages. The core design of Rust is a set of strict safety rules enforced through compile-time checks. However, to support more low-level controls, Rust also allows programmers to bypass its compiler checks by writing \u0000<italic>unsafe</i>\u0000 code. As the adoption of Rust grows in the development of safety-critical software, it becomes increasingly important to understand what safety issues may elude Rust's compiler checks and manifest in real Rust programs. In this paper, we conduct a comprehensive, empirical study of Rust safety issues by close, manual inspection of 70 memory bugs, 100 concurrency bugs, and 110 programming errors leading to unexpected execution panics from five open-source Rust projects, five widely-used Rust libraries, and two online security databases. Our study answers three important questions: what memory-safety issues real Rust programs have, what concurrency bugs Rust programmers make, and how unexpected panics in Rust programs are caused. Our study reveals interesting real-world Rust program behaviors and highlights new issues made by Rust programmers. Building upon the findings of our study, we design and implement five static detectors. After being applied to the studied Rust programs and another 12 selected Rust projects, our checkers pinpoint 96 previously unknown bugs and report a negligible number of false positives, confirming their effectiveness and the value of our empirical study.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":7.4,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140291472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Search-Based Software Microbenchmark Prioritization 评估基于搜索的软件微基准优先级排序
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-03-22 DOI: 10.1109/TSE.2024.3380836
Christoph Laaber;Tao Yue;Shaukat Ali
Ensuring that software performance does not degrade after a code change is paramount. A solution is to regularly execute software microbenchmarks, a performance testing technique similar to (functional) unit tests, which, however, often becomes infeasible due to extensive runtimes. To address that challenge, research has investigated regression testing techniques, such as test case prioritization (TCP), which reorder the execution within a microbenchmark suite to detect larger performance changes sooner. Such techniques are either designed for unit tests and perform sub-par on microbenchmarks or require complex performance models, drastically reducing their potential application. In this paper, we empirically evaluate single- and multi-objective search-based microbenchmark prioritization techniques to understand whether they are more effective and efficient than greedy, coverage-based techniques. For this, we devise three search objectives, i.e., coverage to maximize, coverage overlap to minimize, and historical performance change detection to maximize. We find that search algorithms (SAs) are only competitive with but do not outperform the best greedy, coverage-based baselines. However, a simple greedy technique utilizing solely the performance change history (without coverage information) is equally or more effective than the best coverage-based techniques while being considerably more efficient, with a runtime overhead of less than $1$%. These results show that simple, non-coverage-based techniques are a better fit for microbenchmarks than complex coverage-based techniques.
确保软件性能在代码更改后不会降低是至关重要的。一种解决方案是定期执行软件微基准,这是一种类似于(功能)单元测试的性能测试技术,但由于运行时间过长,这种方法往往不可行。为了应对这一挑战,研究人员对回归测试技术进行了研究,如测试用例优先级排序(TCP),它可以重新安排微基准测试套件的执行顺序,以便更快地检测到较大的性能变化。这些技术要么是为单元测试设计的,在微基准测试中表现不佳,要么需要复杂的性能模型,从而大大降低了其潜在应用价值。在本文中,我们对基于搜索的单目标和多目标微基准优先级排序技术进行了实证评估,以了解这些技术是否比基于覆盖范围的贪婪技术更有效、更高效。为此,我们设计了三个搜索目标,即覆盖最大化、覆盖重叠最小化和历史性能变化检测最大化。我们发现,搜索算法(SA)与基于覆盖率的最佳贪婪基线相比,只有竞争力而没有优越性。然而,仅利用性能变化历史记录(无覆盖信息)的简单贪婪技术与基于覆盖率的最佳技术相比同样有效,甚至更有效,而且效率更高,运行时开销不到 1 美元%。这些结果表明,与复杂的基于覆盖率的技术相比,简单的、不基于覆盖率的技术更适合微基准。
{"title":"Evaluating Search-Based Software Microbenchmark Prioritization","authors":"Christoph Laaber;Tao Yue;Shaukat Ali","doi":"10.1109/TSE.2024.3380836","DOIUrl":"10.1109/TSE.2024.3380836","url":null,"abstract":"Ensuring that software performance does not degrade after a code change is paramount. A solution is to regularly execute software microbenchmarks, a performance testing technique similar to (functional) unit tests, which, however, often becomes infeasible due to extensive runtimes. To address that challenge, research has investigated regression testing techniques, such as test case prioritization (TCP), which reorder the execution within a microbenchmark suite to detect larger performance changes sooner. Such techniques are either designed for unit tests and perform sub-par on microbenchmarks or require complex performance models, drastically reducing their potential application. In this paper, we empirically evaluate single- and multi-objective search-based microbenchmark prioritization techniques to understand whether they are more effective and efficient than greedy, coverage-based techniques. For this, we devise three search objectives, i.e., coverage to maximize, coverage overlap to minimize, and historical performance change detection to maximize. We find that search algorithms (SAs) are only competitive with but do not outperform the best greedy, coverage-based baselines. However, a simple greedy technique utilizing solely the performance change history (without coverage information) is equally or more effective than the best coverage-based techniques while being considerably more efficient, with a runtime overhead of less than \u0000<inline-formula><tex-math>$1$</tex-math></inline-formula>\u0000%. These results show that simple, non-coverage-based techniques are a better fit for microbenchmarks than complex coverage-based techniques.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140192633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Software Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1