首页 > 最新文献

IEEE Transactions on Software Engineering最新文献

英文 中文
Multitask-Based Evaluation of Open-Source LLM on Software Vulnerability 基于多任务的软件漏洞开源 LLM 评估
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-10-07 DOI: 10.1109/TSE.2024.3470333
Xin Yin;Chao Ni;Shaohua Wang
This paper proposes a pipeline for quantitatively evaluating interactive Large Language Models (LLMs) using publicly available datasets. We carry out an extensive technical evaluation of LLMs using Big-Vul covering four different common software vulnerability tasks. This evaluation assesses the multi-tasking capabilities of LLMs based on this dataset. We find that the existing state-of-the-art approaches and pre-trained Language Models (LMs) are generally superior to LLMs in software vulnerability detection. However, in software vulnerability assessment and location, certain LLMs (e.g., CodeLlama and WizardCoder) have demonstrated superior performance compared to pre-trained LMs, and providing more contextual information can enhance the vulnerability assessment capabilities of LLMs. Moreover, LLMs exhibit strong vulnerability description capabilities, but their tendency to produce excessive output significantly weakens their performance compared to pre-trained LMs. Overall, though LLMs perform well in some aspects, they still need improvement in understanding the subtle differences in code vulnerabilities and the ability to describe vulnerabilities to fully realize their potential. Our evaluation pipeline provides valuable insights into the capabilities of LLMs in handling software vulnerabilities.
本文提出了一种利用公开数据集对交互式大型语言模型(LLM)进行定量评估的方法。我们使用 Big-Vul 对 LLM 进行了广泛的技术评估,涵盖了四种不同的常见软件漏洞任务。该评估基于该数据集对 LLM 的多任务处理能力进行了评估。我们发现,在软件漏洞检测方面,现有的最先进方法和预训练语言模型(LM)普遍优于 LLM。不过,在软件漏洞评估和定位方面,某些 LLM(如 CodeLlama 和 WizardCoder)的表现优于预先训练的 LM,而且提供更多上下文信息可以增强 LLM 的漏洞评估能力。此外,LLMs 表现出很强的漏洞描述能力,但与预先训练的 LMs 相比,它们产生过多输出的倾向大大削弱了其性能。总的来说,虽然 LLM 在某些方面表现出色,但它们在理解代码漏洞的细微差别和描述漏洞的能力方面仍需改进,才能充分发挥其潜力。我们的评估管道为 LLMs 处理软件漏洞的能力提供了宝贵的见解。
{"title":"Multitask-Based Evaluation of Open-Source LLM on Software Vulnerability","authors":"Xin Yin;Chao Ni;Shaohua Wang","doi":"10.1109/TSE.2024.3470333","DOIUrl":"10.1109/TSE.2024.3470333","url":null,"abstract":"This paper proposes a pipeline for quantitatively evaluating interactive Large Language Models (LLMs) using publicly available datasets. We carry out an extensive technical evaluation of LLMs using Big-Vul covering four different common software vulnerability tasks. This evaluation assesses the multi-tasking capabilities of LLMs based on this dataset. We find that the existing state-of-the-art approaches and pre-trained Language Models (LMs) are generally superior to LLMs in software vulnerability detection. However, in software vulnerability assessment and location, certain LLMs (e.g., CodeLlama and WizardCoder) have demonstrated superior performance compared to pre-trained LMs, and providing more contextual information can enhance the vulnerability assessment capabilities of LLMs. Moreover, LLMs exhibit strong vulnerability description capabilities, but their tendency to produce excessive output significantly weakens their performance compared to pre-trained LMs. Overall, though LLMs perform well in some aspects, they still need improvement in understanding the subtle differences in code vulnerabilities and the ability to describe vulnerabilities to fully realize their potential. Our evaluation pipeline provides valuable insights into the capabilities of LLMs in handling software vulnerabilities.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 11","pages":"3071-3087"},"PeriodicalIF":6.5,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Qualitative Surveys in Software Engineering Research: Definition, Critical Review, and Guidelines 软件工程研究中的定性调查:定义、严格审查和指南
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-10-04 DOI: 10.1109/TSE.2024.3474173
Jorge Melegati;Kieran Conboy;Daniel Graziotin
Qualitative surveys are emerging as a popular research method in software engineering (SE), particularly as many aspects of the field are increasingly socio-technical and thus concerned with the subtle, social, and often ambiguous issues that are not amenable to a simple quantitative survey. While many argue that qualitative surveys play a vital role amongst the diverse range of methods employed in SE there are a number of shortcomings that inhibits its use and value. First there is a lack of clarity as to what defines a qualitative survey and what features differentiate it from other methods. There is an absence of a clear set of principles and guidelines for its execution, and what does exist is very inconsistent and sometimes contradictory. These issues undermine the perceived reliability and rigour of this method. Researchers are unsure about how to ensure reliability and rigour when designing qualitative surveys and reviewers are unsure how these should be evaluated. In this paper, we present a systematic mapping study to identify how qualitative surveys have been employed in SE research to date. This paper proposes a set of principles, based on a multidisciplinary review of qualitative surveys and capturing some of the commonalities of the diffuse approaches found. These principles can be used by researchers when choosing whether to do a qualitative survey or not. They can then be used to design their study. The principles can also be used by editors and reviewers to judge the quality and rigour of qualitative surveys. It is hoped that this will result in more widespread use of the method and also more effective and evidence-based reviews of studies that use these methods in the future.
定性调查正在成为软件工程(SE)中流行的研究方法,特别是当该领域的许多方面越来越多地涉及社会技术,因此涉及微妙的,社会的,并且经常是模棱两可的问题,这些问题不适合简单的定量调查。虽然许多人认为定性调查在SE中使用的各种方法中起着至关重要的作用,但仍有许多缺点限制了它的使用和价值。首先,定性调查的定义以及定性调查与其他方法的区别尚不明确。目前还没有一套明确的原则和指导方针来执行它,而现有的原则和指导方针是非常不一致的,有时是相互矛盾的。这些问题破坏了这种方法的可靠性和严谨性。研究人员不确定在设计定性调查时如何确保可靠性和严谨性,审稿人也不确定应该如何评估这些调查。在本文中,我们提出了一项系统的测绘研究,以确定迄今为止定性调查是如何在SE研究中使用的。本文提出了一套原则,基于对定性调查的多学科审查,并抓住了所发现的扩散方法的一些共性。研究人员在选择是否进行定性调查时可以使用这些原则。然后可以用它们来设计他们的研究。这些原则也可以被编辑和审稿人用来判断定性调查的质量和严谨性。希望这将导致更广泛地使用该方法,并在未来对使用这些方法的研究进行更有效和基于证据的审查。
{"title":"Qualitative Surveys in Software Engineering Research: Definition, Critical Review, and Guidelines","authors":"Jorge Melegati;Kieran Conboy;Daniel Graziotin","doi":"10.1109/TSE.2024.3474173","DOIUrl":"10.1109/TSE.2024.3474173","url":null,"abstract":"Qualitative surveys are emerging as a popular research method in software engineering (SE), particularly as many aspects of the field are increasingly socio-technical and thus concerned with the subtle, social, and often ambiguous issues that are not amenable to a simple quantitative survey. While many argue that qualitative surveys play a vital role amongst the diverse range of methods employed in SE there are a number of shortcomings that inhibits its use and value. First there is a lack of clarity as to what defines a qualitative survey and what features differentiate it from other methods. There is an absence of a clear set of principles and guidelines for its execution, and what does exist is very inconsistent and sometimes contradictory. These issues undermine the perceived reliability and rigour of this method. Researchers are unsure about how to ensure reliability and rigour when designing qualitative surveys and reviewers are unsure how these should be evaluated. In this paper, we present a systematic mapping study to identify how qualitative surveys have been employed in SE research to date. This paper proposes a set of principles, based on a multidisciplinary review of qualitative surveys and capturing some of the commonalities of the diffuse approaches found. These principles can be used by researchers when choosing whether to do a qualitative survey or not. They can then be used to design their study. The principles can also be used by editors and reviewers to judge the quality and rigour of qualitative surveys. It is hoped that this will result in more widespread use of the method and also more effective and evidence-based reviews of studies that use these methods in the future.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3172-3187"},"PeriodicalIF":6.5,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10705351","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142377310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair FlakyFix:使用大型语言模型预测缺陷测试修复类别和测试代码修复
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-10-02 DOI: 10.1109/TSE.2024.3472476
Sakina Fatima;Hadi Hemmati;Lionel C. Briand
Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. We do this for a subset of flaky tests where the root cause of flakiness is in the test itself and not in the production code. One key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category. Thus, we first propose a framework that automatically generates labeled datasets for 13 fix categories and trains models to predict the fix category of a flaky test by analyzing the test code only. Our experimental results using code models and few-shot learning show that we can correctly predict most of the fix categories. To show the usefulness of such fix category labels for automatically repairing flakiness, we augment the prompts of GPT 3.5 Turbo, a Large Language Model (LLM), with such extra knowledge to request repair suggestions. The results show that our suggested fix category labels, complemented with in-context learning, significantly enhance the capability of GPT 3.5 Turbo in generating fixes for flaky tests. Based on the execution and analysis of a sample of GPT-repaired flaky tests, we estimate that a large percentage of such repairs, (roughly between 51% and 83%) can be expected to pass. For the failing repaired tests, on average, 16% of the test code needs to be further changed for them to pass.
不稳定的测试是有问题的,因为它们不确定地通过或失败于测试中的相同软件版本,从而导致混乱并浪费开发工作。虽然机器学习模型已被用于预测脆弱及其根本原因,但在提供支持以解决问题方面的工作要少得多。为了解决这个问题,在本文中,我们将重点放在预测移除脆弱所需的修复类型,然后在此基础上修复测试代码。我们这样做是为了一些片状测试的子集,其中片状的根本原因在测试本身,而不是在生产代码中。一个关键的想法是,以其预测的修复类别的形式,使用有关测试的脆弱性的额外知识来指导修复过程。因此,我们首先提出了一个框架,该框架自动生成13个固定类别的标记数据集,并训练模型,仅通过分析测试代码来预测片状测试的固定类别。我们使用代码模型和少量学习的实验结果表明,我们可以正确地预测大多数修复类别。为了显示这种修复类别标签对自动修复碎片的有用性,我们增加了GPT 3.5 Turbo(一个大型语言模型)的提示符,并使用这些额外的知识来请求修复建议。结果表明,我们建议的修复类别标签与上下文学习相结合,显著增强了GPT 3.5 Turbo为片状测试生成修复的能力。根据对gpt修复的片状测试样本的执行和分析,我们估计这种修复的很大比例(大约在51%到83%之间)可以预期通过。对于失败的修复测试,平均而言,需要进一步更改16%的测试代码才能通过。
{"title":"FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair","authors":"Sakina Fatima;Hadi Hemmati;Lionel C. Briand","doi":"10.1109/TSE.2024.3472476","DOIUrl":"10.1109/TSE.2024.3472476","url":null,"abstract":"Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. We do this for a subset of flaky tests where the root cause of flakiness is in the test itself and not in the production code. One key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category. Thus, we first propose a framework that automatically generates labeled datasets for 13 fix categories and trains models to predict the fix category of a flaky test by analyzing the test code only. Our experimental results using code models and few-shot learning show that we can correctly predict most of the fix categories. To show the usefulness of such fix category labels for automatically repairing flakiness, we augment the prompts of GPT 3.5 Turbo, a Large Language Model (LLM), with such extra knowledge to request repair suggestions. The results show that our suggested fix category labels, complemented with in-context learning, significantly enhance the capability of GPT 3.5 Turbo in generating fixes for flaky tests. Based on the execution and analysis of a sample of GPT-repaired flaky tests, we estimate that a large percentage of such repairs, (roughly between 51% and 83%) can be expected to pass. For the failing repaired tests, on average, 16% of the test code needs to be further changed for them to pass.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3146-3171"},"PeriodicalIF":6.5,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704582","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142368849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LTM: Scalable and Black-Box Similarity-Based Test Suite Minimization Based on Language Models LTM:基于语言模型的可扩展黑盒相似性测试套件最小化
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-09-30 DOI: 10.1109/TSE.2024.3469582
Rongqi Pan;Taher A. Ghaleb;Lionel C. Briand
Test suites tend to grow when software evolves, making it often infeasible to execute all test cases with the allocated testing budgets, especially for large software systems. Test suite minimization (TSM) is employed to improve the efficiency of software testing by removing redundant test cases, thus reducing testing time and resources while maintaining the fault detection capability of the test suite. Most existing TSM approaches rely on code coverage (white-box) or model-based features, which are not always available to test engineers. Recent TSM approaches that rely only on test code (black-box) have been proposed, such as ATM and FAST-R. The former yields higher fault detection rates (FDR) while the latter is faster. To address scalability while retaining a high FDR, we propose LTM (Language model-based Test suite Minimization), a novel, scalable, and black-box similarity-based TSM approach based on large language models (LLMs), which is the first application of LLMs in the context of TSM. To support similarity measurement using test method embeddings, we investigate five different pre-trained language models: CodeBERT, GraphCodeBERT, UniXcoder, StarEncoder, and CodeLlama, on which we compute two similarity measures: Cosine Similarity and Euclidean Distance. Our goal is to find similarity measures that are not only computationally more efficient but can also better guide a Genetic Algorithm (GA), which is used to search for optimal minimized test suites, thus reducing the overall search time. Experimental results show that the best configuration of LTM (UniXcoder/Cosine) outperforms ATM in three aspects: (a) achieving a slightly greater saving rate of testing time ($41.72%$ versus $41.02%$, on average); (b) attaining a significantly higher fault detection rate ($0.84$ versus $0.81$, on average); and, most importantly, (c) minimizing test suites nearly five times faster on average, with higher gains for larger test suites and systems, thus achieving much higher scalability.
随着软件的发展,测试套件往往会不断增加,因此往往无法用分配的测试预算执行所有测试用例,尤其是大型软件系统。测试套件最小化(TSM)通过删除多余的测试用例来提高软件测试的效率,从而减少测试时间和资源,同时保持测试套件的故障检测能力。大多数现有的 TSM 方法都依赖于代码覆盖率(白盒)或基于模型的功能,而测试工程师并非总能获得这些功能。最近提出的 TSM 方法仅依赖于测试代码(黑盒),如 ATM 和 FAST-R。前者的故障检测率(FDR)更高,后者更快。为了在保持高 FDR 的同时解决可扩展性问题,我们提出了 LTM(基于语言模型的测试套件最小化),这是一种基于大型语言模型(LLM)的新颖、可扩展、基于黑盒相似性的 TSM 方法,也是 LLM 在 TSM 中的首次应用。为了支持使用测试方法嵌入进行相似性测量,我们研究了五种不同的预训练语言模型:CodeBERT、GraphCodeBERT、UniXcoder、StarEncoder 和 CodeLlama:余弦相似度和欧氏距离。我们的目标是找到不仅计算效率更高,而且能更好地指导遗传算法(GA)的相似性度量,遗传算法用于搜索最优的最小化测试套件,从而减少整体搜索时间。实验结果表明,LTM 的最佳配置(UniXcoder/Cosine)在以下三个方面优于 ATM:(a) 测试时间节省率略高(平均为 41.72 美元/%$,而 ATM 为 41.02 美元/%$);(b) 故障检测率显著提高(平均为 0.84 美元/%$,而 ATM 为 0.81 美元/%$);最重要的是,(c) 最小化测试套件的速度平均提高了近五倍,对于较大的测试套件和系统,提高的幅度更大,从而实现了更高的可扩展性。
{"title":"LTM: Scalable and Black-Box Similarity-Based Test Suite Minimization Based on Language Models","authors":"Rongqi Pan;Taher A. Ghaleb;Lionel C. Briand","doi":"10.1109/TSE.2024.3469582","DOIUrl":"10.1109/TSE.2024.3469582","url":null,"abstract":"Test suites tend to grow when software evolves, making it often infeasible to execute all test cases with the allocated testing budgets, especially for large software systems. Test suite minimization (TSM) is employed to improve the efficiency of software testing by removing redundant test cases, thus reducing testing time and resources while maintaining the fault detection capability of the test suite. Most existing TSM approaches rely on code coverage (white-box) or model-based features, which are not always available to test engineers. Recent TSM approaches that rely only on test code (black-box) have been proposed, such as ATM and FAST-R. The former yields higher fault detection rates (\u0000<i>FDR</i>\u0000) while the latter is faster. To address scalability while retaining a high \u0000<i>FDR</i>\u0000, we propose LTM (\u0000<b>L</b>\u0000anguage model-based \u0000<b>T</b>\u0000est suite \u0000<b>M</b>\u0000inimization), a novel, scalable, and black-box similarity-based TSM approach based on large language models (LLMs), which is the first application of LLMs in the context of TSM. To support similarity measurement using test method embeddings, we investigate five different pre-trained language models: CodeBERT, GraphCodeBERT, UniXcoder, StarEncoder, and CodeLlama, on which we compute two similarity measures: Cosine Similarity and Euclidean Distance. Our goal is to find similarity measures that are not only computationally more efficient but can also better guide a Genetic Algorithm (GA), which is used to search for optimal minimized test suites, thus reducing the overall search time. Experimental results show that the best configuration of LTM (UniXcoder/Cosine) outperforms ATM in three aspects: (a) achieving a slightly greater saving rate of testing time (\u0000<inline-formula><tex-math>$41.72%$</tex-math></inline-formula>\u0000 versus \u0000<inline-formula><tex-math>$41.02%$</tex-math></inline-formula>\u0000, on average); (b) attaining a significantly higher fault detection rate (\u0000<inline-formula><tex-math>$0.84$</tex-math></inline-formula>\u0000 versus \u0000<inline-formula><tex-math>$0.81$</tex-math></inline-formula>\u0000, on average); and, most importantly, (c) minimizing test suites nearly five times faster on average, with higher gains for larger test suites and systems, thus achieving much higher scalability.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 11","pages":"3053-3070"},"PeriodicalIF":6.5,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10697930","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast and Precise Static Null Exception Analysis With Synergistic Preprocessing 利用协同预处理进行快速精确的静态空异常分析
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-09-23 DOI: 10.1109/TSE.2024.3466551
Yi Sun;Chengpeng Wang;Gang Fan;Qingkai Shi;Xiangyu Zhang
Pointer operations are common in programs written in modern programming languages such as C/C++ and Java. While widely used, pointer operations often suffer from bugs like null pointer exceptions that make software systems vulnerable and unstable. However, precisely verifying the absence of null pointer exceptions is notoriously slow as we need to inspect a huge number of pointer-dereferencing operations one by one via expensive techniques like SMT solving. We observe that, among all pointer-dereferencing operations in a program, a large number can be proven to be safe by lightweight preprocessing. Thus, we can avoid employing costly techniques to verify their nullity. The impacts of lightweight preprocessing techniques are significantly less studied and ignored by recent works. In this paper, we propose a new technique, BONA, which leverages the synergistic effects of two classic preprocessing analyses. The synergistic effects between the two preprocessing analyses allow us to recognize a lot more safe pointer operations before a follow-up costly nullity verification, thus improving the scalability of the whole null exception analysis. We have implemented our synergistic preprocessing procedure in two state-of-the-art static analyzers, KLEE and Pinpoint. The evaluation results demonstrate that BONA itself is fast and can finish in a few seconds for programs that KLEE and Pinpoint may require several minutes or even hours to analyze. Compared to the vanilla versions of KLEE and Pinpoint, BONA respectively enables them to achieve up to 1.6x and 6.6x speedup (1.2x and 3.8x on average) with less than 0.5% overhead. Such a speedup is significant enough as it allows KLEE and Pinpoint to check more pointer-dereferencing operations in a given time budget and, thus, discover over a dozen previously unknown null pointer exceptions in open-source projects.
指针操作在 C/C++ 和 Java 等现代编程语言编写的程序中很常见。虽然指针操作被广泛使用,但指针操作经常出现空指针异常等错误,使软件系统变得脆弱和不稳定。然而,精确验证是否存在空指针异常是出了名的慢,因为我们需要通过昂贵的技术(如 SMT 求解)逐一检查大量的指针反引用操作。我们发现,在程序中的所有指针参照操作中,有大量操作可以通过轻量级预处理证明是安全的。因此,我们可以避免使用昂贵的技术来验证它们的无效性。轻量级预处理技术的影响在近期的研究中被忽视,研究较少。在本文中,我们提出了一种新技术--BONA,它充分利用了两种经典预处理分析的协同效应。两种预处理分析的协同效应使我们能够在后续代价高昂的无效性验证之前识别出更多安全的指针操作,从而提高整个无效异常分析的可扩展性。我们在 KLEE 和 Pinpoint 这两个最先进的静态分析器中实施了我们的协同预处理程序。评估结果表明,BONA 本身的速度很快,对于 KLEE 和 Pinpoint 可能需要几分钟甚至几小时才能分析完的程序,BONA 可以在几秒钟内完成分析。与普通版本的 KLEE 和 Pinpoint 相比,BONA 使它们的速度分别提高了 1.6 倍和 6.6 倍(平均 1.2 倍和 3.8 倍),而开销却不到 0.5%。这样的提速非常显著,因为它允许 KLEE 和 Pinpoint 在给定的时间预算内检查更多的指针参照操作,从而在开源项目中发现了十多个以前未知的空指针异常。
{"title":"Fast and Precise Static Null Exception Analysis With Synergistic Preprocessing","authors":"Yi Sun;Chengpeng Wang;Gang Fan;Qingkai Shi;Xiangyu Zhang","doi":"10.1109/TSE.2024.3466551","DOIUrl":"10.1109/TSE.2024.3466551","url":null,"abstract":"Pointer operations are common in programs written in modern programming languages such as C/C++ and Java. While widely used, pointer operations often suffer from bugs like null pointer exceptions that make software systems vulnerable and unstable. However, precisely verifying the absence of null pointer exceptions is notoriously slow as we need to inspect a huge number of pointer-dereferencing operations one by one via expensive techniques like SMT solving. We observe that, among all pointer-dereferencing operations in a program, a large number can be proven to be safe by lightweight preprocessing. Thus, we can avoid employing costly techniques to verify their nullity. The impacts of lightweight preprocessing techniques are significantly less studied and ignored by recent works. In this paper, we propose a new technique, BONA, which leverages the synergistic effects of two classic preprocessing analyses. The synergistic effects between the two preprocessing analyses allow us to recognize a lot more safe pointer operations before a follow-up costly nullity verification, thus improving the scalability of the whole null exception analysis. We have implemented our synergistic preprocessing procedure in two state-of-the-art static analyzers, KLEE and Pinpoint. The evaluation results demonstrate that BONA itself is fast and can finish in a few seconds for programs that KLEE and Pinpoint may require several minutes or even hours to analyze. Compared to the vanilla versions of KLEE and Pinpoint, BONA respectively enables them to achieve up to 1.6x and 6.6x speedup (1.2x and 3.8x on average) with less than 0.5% overhead. Such a speedup is significant enough as it allows KLEE and Pinpoint to check more pointer-dereferencing operations in a given time budget and, thus, discover over a dozen previously unknown null pointer exceptions in open-source projects.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 11","pages":"3022-3036"},"PeriodicalIF":6.5,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142313658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards a Cognitive Model of Dynamic Debugging: Does Identifier Construction Matter? 建立动态调试的认知模型:标识符构造重要吗?
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-09-20 DOI: 10.1109/TSE.2024.3465222
Danniell Hu;Priscila Santiesteban;Madeline Endres;Westley Weimer
Debugging is a vital and time-consuming process in software engineering. Recently, researchers have begun using neuroimaging to understand the cognitive bases of programming tasks by measuring patterns of neural activity. While exciting, prior studies have only examined small sub-steps in isolation, such as comprehending a method without writing any code or writing a method from scratch without reading any already-existing code. We propose a simple multi-stage debugging model in which programmers transition between Task Comprehension, Fault Localization, Code Editing, Compiling, and Output Comprehension activities. We conduct a human study of $n=28$ participants using a combination of functional near-infrared spectroscopy and standard coding measurements (e.g., time taken, tests passed, etc.). Critically, we find that our proposed debugging stages are both neurally and behaviorally distinct. To the best of our knowledge, this is the first neurally-justified cognitive model of debugging. At the same time, there is significant interest in understanding how programmers from different backgrounds, such as those grappling with challenges in English prose comprehension, are impacted by code features when debugging. We use our cognitive model of debugging to investigate the role of one such feature: identifier construction. Specifically, we investigate how features of identifier construction impact neural activity while debugging by participants with and without reading difficulties. While we find significant differences in cognitive load as a function of morphology and expertise, we do not find significant differences in end-to-end programming outcomes (e.g., time, correctness, etc.). This nuanced result suggests that prior findings on the cognitive importance of identifier naming in isolated sub-steps may not generalize to end-to-end debugging. Finally, in a result relevant to broadening participation in computing, we find no behavioral outcome differences for participants with reading difficulties.
调试是软件工程中一个重要而耗时的过程。最近,研究人员开始利用神经成像技术,通过测量神经活动模式来了解编程任务的认知基础。之前的研究虽然令人兴奋,但只是孤立地研究了一些小的子步骤,例如在不编写任何代码的情况下理解一个方法,或者在不阅读任何已有代码的情况下从头开始编写一个方法。我们提出了一个简单的多阶段调试模式,程序员可以在任务理解、故障定位、代码编辑、编译和输出理解活动之间转换。我们使用功能性近红外光谱和标准编码测量(如所花费的时间、通过的测试等)相结合的方法,对 $n=28$ 的参与者进行了人体研究。重要的是,我们发现我们提出的调试阶段在神经和行为上都是不同的。据我们所知,这是第一个在神经上合理的调试认知模型。与此同时,人们对了解来自不同背景的程序员(例如那些在英语散文理解方面遇到困难的程序员)在调试时如何受到代码特征的影响有着浓厚的兴趣。我们使用我们的调试认知模型来研究这样一种特征的作用:标识符构造。具体来说,我们研究了有阅读困难和没有阅读困难的参与者在调试时,标识符构造特征如何影响神经活动。虽然我们发现认知负荷在形态和专业知识方面存在显著差异,但在端到端编程结果(如时间、正确性等)方面却没有发现显著差异。这一微妙的结果表明,之前关于孤立子步骤中标识符命名的认知重要性的研究结果可能无法推广到端到端调试中。最后,在一项与扩大计算参与相关的结果中,我们发现有阅读困难的参与者在行为结果上没有差异。
{"title":"Towards a Cognitive Model of Dynamic Debugging: Does Identifier Construction Matter?","authors":"Danniell Hu;Priscila Santiesteban;Madeline Endres;Westley Weimer","doi":"10.1109/TSE.2024.3465222","DOIUrl":"10.1109/TSE.2024.3465222","url":null,"abstract":"Debugging is a vital and time-consuming process in software engineering. Recently, researchers have begun using neuroimaging to understand the cognitive bases of programming tasks by measuring patterns of neural activity. While exciting, prior studies have only examined small sub-steps in isolation, such as comprehending a method without writing any code or writing a method from scratch without reading any already-existing code. We propose a simple multi-stage debugging model in which programmers transition between Task Comprehension, Fault Localization, Code Editing, Compiling, and Output Comprehension activities. We conduct a human study of \u0000<inline-formula><tex-math>$n=28$</tex-math></inline-formula>\u0000 participants using a combination of functional near-infrared spectroscopy and standard coding measurements (e.g., time taken, tests passed, etc.). Critically, we find that our proposed debugging stages are both neurally and behaviorally distinct. To the best of our knowledge, this is the first neurally-justified cognitive model of debugging. At the same time, there is significant interest in understanding how programmers from different backgrounds, such as those grappling with challenges in English prose comprehension, are impacted by code features when debugging. We use our cognitive model of debugging to investigate the role of one such feature: identifier construction. Specifically, we investigate how features of identifier construction impact neural activity while debugging by participants with and without reading difficulties. While we find significant differences in cognitive load as a function of morphology and expertise, we do not find significant differences in end-to-end programming outcomes (e.g., time, correctness, etc.). This nuanced result suggests that prior findings on the cognitive importance of identifier naming in isolated sub-steps may not generalize to end-to-end debugging. Finally, in a result relevant to broadening participation in computing, we find no behavioral outcome differences for participants with reading difficulties.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 11","pages":"3007-3021"},"PeriodicalIF":6.5,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142275365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCAnoGenerator: Automatic Anomaly Injection for Ethereum Smart Contracts SCAnoGenerator:以太坊智能合约的自动异常注入
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-09-20 DOI: 10.1109/TSE.2024.3464539
Pengcheng Zhang;Ben Wang;Xiapu Luo;Hai Dong
Although many tools have been developed to detect anomalies in smart contracts, the evaluation of these analysis tools has been hindered by the lack of adequate anomalistic real-world contracts (i.e., smart contracts with addresses on Ethereum to achieve certain purposes). This problem prevents conducting reliable performance assessments on the analysis tools. An effective way to solve this problem is to inject anomalies into real-world contracts and automatically label the locations and types of the injected anomalies. SolidiFI, as the first and only tool in this area, was developed to automatically inject anomalies into Ethereum smart contracts. However, SolidiFI is subject to the limitations from its methodologies (e.g., its injection accuracy and authenticity are low). To address these limitations, we propose an approach called SCAnoGenerator. SCAnoGenerator supports Solidity 0.5.x, 0.6.x, 0.7.x and enables automatic anomaly injection for Ethereum smart contracts via analyzing the contracts’ control and data flows. Based on this approach, we develop an open-source tool, which can inject 20 types of anomalies into smart contracts. The extensive experiments show that SCAnoGenerator outperforms SolidiFI on the number of injected anomaly types, injection accuracy, and injection authenticity. The experimental results also reveal that existing analysis tools can only partially detect the anomalies injected by SCAnoGenerator.
尽管已经开发了许多工具来检测智能合约中的异常情况,但由于缺乏足够的异常真实合约(即在以太坊上拥有地址以实现特定目的的智能合约),对这些分析工具的评估一直受到阻碍。这个问题阻碍了对分析工具进行可靠的性能评估。解决这一问题的有效方法是向现实世界中的合约注入异常情况,并自动标注注入异常情况的位置和类型。SolidiFI 是该领域第一个也是唯一一个自动向以太坊智能合约注入异常点的工具。然而,SolidiFI 受到其方法论的限制(例如,其注入准确性和真实性较低)。为了解决这些局限性,我们提出了一种名为 SCAnoGenerator 的方法。SCAnoGenerator 支持 Solidity 0.5.x、0.6.x 和 0.7.x,可通过分析以太坊智能合约的控制流和数据流实现自动异常注入。在此基础上,我们开发了一款开源工具,可为智能合约注入 20 种异常情况。大量实验表明,SCAnoGenerator 在注入异常类型的数量、注入准确性和注入真实性方面都优于 SolidiFI。实验结果还显示,现有的分析工具只能部分检测到 SCAnoGenerator 注入的异常。
{"title":"SCAnoGenerator: Automatic Anomaly Injection for Ethereum Smart Contracts","authors":"Pengcheng Zhang;Ben Wang;Xiapu Luo;Hai Dong","doi":"10.1109/TSE.2024.3464539","DOIUrl":"10.1109/TSE.2024.3464539","url":null,"abstract":"Although many tools have been developed to detect anomalies in smart contracts, the evaluation of these analysis tools has been hindered by the lack of adequate anomalistic \u0000<italic>real-world contracts</i>\u0000 (i.e., smart contracts with addresses on Ethereum to achieve certain purposes). This problem prevents conducting reliable performance assessments on the analysis tools. An effective way to solve this problem is to inject anomalies into \u0000<italic>real-world contracts</i>\u0000 and automatically label the locations and types of the injected anomalies. \u0000<italic>SolidiFI</i>\u0000, as the first and only tool in this area, was developed to automatically inject anomalies into Ethereum smart contracts. However, \u0000<italic>SolidiFI</i>\u0000 is subject to the limitations from its methodologies (e.g., its injection accuracy and authenticity are low). To address these limitations, we propose an approach called \u0000<italic>SCAnoGenerator</i>\u0000. \u0000<italic>SCAnoGenerator</i>\u0000 supports Solidity 0.5.x, 0.6.x, 0.7.x and enables automatic anomaly injection for Ethereum smart contracts via analyzing the contracts’ control and data flows. Based on this approach, we develop an open-source tool, which can inject 20 types of anomalies into smart contracts. The extensive experiments show that \u0000<italic>SCAnoGenerator</i>\u0000 outperforms \u0000<italic>SolidiFI</i>\u0000 on the number of injected anomaly types, injection accuracy, and injection authenticity. The experimental results also reveal that existing analysis tools can only partially detect the anomalies injected by \u0000<italic>SCAnoGenerator</i>\u0000.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 11","pages":"2983-3006"},"PeriodicalIF":6.5,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142275366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Metamorphic Testing of Image Captioning Systems via Image-Level Reduction 通过图像级缩减对图像字幕系统进行变形测试
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-09-19 DOI: 10.1109/TSE.2024.3463747
Xiaoyuan Xie;Xingpeng Li;Songqiang Chen
The Image Captioning (IC) technique is widely used to describe images in natural language. However, even state-of-the-art IC systems can still produce incorrect captions and lead to misunderstandings. Recently, some IC system testing methods have been proposed. However, these methods still rely on pre-annotated information and hence cannot really alleviate the difficulty in identifying the test oracle. Furthermore, their methods artificially manipulate objects, which may generate unreal images as test cases and thus lead to less meaningful testing results. Thirdly, existing methods have various requirements on the eligibility of source test cases, and hence cannot fully utilize the given images to perform testing. To tackle these issues, in this paper, we propose ReIC to perform metamorphic testing for the IC systems with some image-level reduction transformations like image cropping and stretching. Instead of relying on the pre-annotated information, ReIC uses a localization method to align objects in the caption with corresponding objects in the image, and checks whether each object is correctly described or deleted in the caption after transformation. With the image-level reduction transformations, ReIC does not artificially manipulate any objects and hence can avoid generating unreal follow-up images. Additionally, it eliminates the requirement on the eligibility of source test cases during the metamorphic transformation process, as well as decreases the ambiguity and boosts the diversity among the follow-up test cases, which consequently enables testing to be performed on any test image and reveals more distinct valid violations. We employ ReIC to test five popular IC systems. The results demonstrate that ReIC can sufficiently leverage the provided test images to generate follow-up cases of good realism, and effectively detect a great number of distinct violations, without the need for any pre-annotated information.
图像标题(IC)技术被广泛用于用自然语言描述图像。然而,即使是最先进的 IC 系统也会产生错误的标题,导致误解。最近,人们提出了一些 IC 系统测试方法。然而,这些方法仍然依赖于预先注释的信息,因此无法真正缓解识别测试甲骨文的困难。此外,这些方法人为地处理对象,可能产生虚假图像作为测试用例,从而导致测试结果意义不大。第三,现有方法对源测试用例的合格性有各种要求,因此无法充分利用给定图像进行测试。为了解决这些问题,我们在本文中提出了 ReIC 方法,通过图像裁剪和拉伸等图像级缩减变换,对集成电路系统进行变形测试。ReIC 不依赖预先标注的信息,而是使用定位方法将标题中的对象与图像中的相应对象对齐,并检查变换后标题中每个对象的描述或删除是否正确。通过图像级还原转换,ReIC 不会人为处理任何对象,因此可以避免生成不真实的后续图像。此外,ReIC 在变形过程中消除了对源测试用例合格性的要求,并减少了后续测试用例的模糊性和多样性,从而使测试可以在任何测试图像上进行,并揭示出更多不同的有效违规行为。我们使用 ReIC 测试了五种流行的集成电路系统。结果表明,ReIC 可以充分利用所提供的测试图像生成逼真的后续案例,并有效检测出大量不同的违规行为,而无需任何预先标注的信息。
{"title":"Metamorphic Testing of Image Captioning Systems via Image-Level Reduction","authors":"Xiaoyuan Xie;Xingpeng Li;Songqiang Chen","doi":"10.1109/TSE.2024.3463747","DOIUrl":"10.1109/TSE.2024.3463747","url":null,"abstract":"The Image Captioning (IC) technique is widely used to describe images in natural language. However, even state-of-the-art IC systems can still produce incorrect captions and lead to misunderstandings. Recently, some IC system testing methods have been proposed. However, these methods still rely on pre-annotated information and hence cannot really alleviate the difficulty in identifying the test oracle. Furthermore, their methods artificially manipulate objects, which may generate unreal images as test cases and thus lead to less meaningful testing results. Thirdly, existing methods have various requirements on the eligibility of source test cases, and hence cannot fully utilize the given images to perform testing. To tackle these issues, in this paper, we propose \u0000<sc>ReIC</small>\u0000 to perform metamorphic testing for the IC systems with some image-level reduction transformations like image cropping and stretching. Instead of relying on the pre-annotated information, \u0000<sc>ReIC</small>\u0000 uses a localization method to align objects in the caption with corresponding objects in the image, and checks whether each object is correctly described or deleted in the caption after transformation. With the image-level reduction transformations, \u0000<sc>ReIC</small>\u0000 does not artificially manipulate any objects and hence can avoid generating unreal follow-up images. Additionally, it eliminates the requirement on the eligibility of source test cases during the metamorphic transformation process, as well as decreases the ambiguity and boosts the diversity among the follow-up test cases, which consequently enables testing to be performed on any test image and reveals more distinct valid violations. We employ \u0000<sc>ReIC</small>\u0000 to test five popular IC systems. The results demonstrate that \u0000<sc>ReIC</small>\u0000 can sufficiently leverage the provided test images to generate follow-up cases of good realism, and effectively detect a great number of distinct violations, without the need for any pre-annotated information.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 11","pages":"2962-2982"},"PeriodicalIF":6.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142275367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Measuring the Fidelity of a Physical and a Digital Twin Using Trace Alignments 利用轨迹对齐测量物理孪生体和数字孪生体的保真度
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-09-18 DOI: 10.1109/TSE.2024.3462978
Paula Muñoz;Manuel Wimmer;Javier Troya;Antonio Vallecillo
Digital twins are gaining relevance in many domains to improve the operation and maintenance of complex systems. Despite their importance, most efforts are currently focused on their design, development, and deployment but do not fully address their validation. In this paper, we are interested in assessing the fidelity of physical and digital twins and, more specifically, whether they exhibit twinned behaviors. This will allow engineers to check the suitability of the digital twin for its intended purpose. Our approach assesses their fidelity by comparing the behavioral traces of the two twins. Our contribution is threefold. First, we define a measure of equivalence between individual snapshots capable of deciding whether two snapshots are sufficiently similar. Second, we use a trace alignment algorithm to align the corresponding equivalent states reached by the two twins. Finally, we measure the fidelity of the behavior of the two twins using the level of alignment achieved in terms of the percentage of matched snapshots and the distance between the aligned traces. Our proposal has been validated with the digital twins of four cyber-physical systems: an elevator, an incubator, a robotic arm, and a programmable robotic car. We were able to determine which systems were sufficiently faithful and which parts of their behavior failed to emulate their counterparts. Finally, we compared our proposal with similar approaches from the literature, highlighting their respective strengths and weaknesses related to our own.
数字孪生在许多领域越来越重要,以改善复杂系统的操作和维护。尽管它们很重要,但目前大多数工作都集中在它们的设计、开发和部署上,而没有完全解决它们的验证问题。在本文中,我们感兴趣的是评估物理和数字双胞胎的保真度,更具体地说,他们是否表现出双胞胎的行为。这将允许工程师检查数字双胞胎是否适合其预期用途。我们的方法是通过比较两个双胞胎的行为轨迹来评估他们的忠诚度。我们的贡献是三重的。首先,我们定义了单个快照之间的等价度量,能够决定两个快照是否足够相似。其次,我们使用跟踪对齐算法来对齐两个双胞胎达到的相应等效状态。最后,我们根据匹配快照的百分比和对齐轨迹之间的距离实现的对齐水平来测量两个双胞胎行为的保真度。我们的提议已经通过四个网络物理系统的数字双胞胎得到了验证:电梯、孵化器、机械臂和可编程机器人汽车。我们能够确定哪些系统是足够可靠的,哪些部分的行为无法模仿它们的对应系统。最后,我们将我们的建议与文献中的类似方法进行了比较,突出了各自的优点和与我们自己的方法相关的缺点。
{"title":"Measuring the Fidelity of a Physical and a Digital Twin Using Trace Alignments","authors":"Paula Muñoz;Manuel Wimmer;Javier Troya;Antonio Vallecillo","doi":"10.1109/TSE.2024.3462978","DOIUrl":"10.1109/TSE.2024.3462978","url":null,"abstract":"Digital twins are gaining relevance in many domains to improve the operation and maintenance of complex systems. Despite their importance, most efforts are currently focused on their design, development, and deployment but do not fully address their validation. In this paper, we are interested in assessing the fidelity of physical and digital twins and, more specifically, whether they exhibit twinned behaviors. This will allow engineers to check the suitability of the digital twin for its intended purpose. Our approach assesses their fidelity by comparing the behavioral traces of the two twins. Our contribution is threefold. First, we define a measure of equivalence between individual snapshots capable of deciding whether two snapshots are sufficiently similar. Second, we use a trace alignment algorithm to align the corresponding equivalent states reached by the two twins. Finally, we measure the fidelity of the behavior of the two twins using the level of alignment achieved in terms of the percentage of matched snapshots and the distance between the aligned traces. Our proposal has been validated with the digital twins of four cyber-physical systems: an elevator, an incubator, a robotic arm, and a programmable robotic car. We were able to determine which systems were sufficiently faithful and which parts of their behavior failed to emulate their counterparts. Finally, we compared our proposal with similar approaches from the literature, highlighting their respective strengths and weaknesses related to our own.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 12","pages":"3122-3145"},"PeriodicalIF":6.5,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142245651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mitigating Noise in Quantum Software Testing Using Machine Learning 利用机器学习减少量子软件测试中的噪音
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-09-18 DOI: 10.1109/TSE.2024.3462974
Asmar Muqeet;Tao Yue;Shaukat Ali;Paolo Arcaini
Quantum Computing (QC) promises computational speedup over classic computing. However, noise exists in near-term quantum computers. Quantum software testing (for gaining confidence in quantum software's correctness) is inevitably impacted by noise, i.e., it is impossible to know if a test case failed due to noise or real faults. Existing testing techniques test quantum programs without considering noise, i.e., by executing tests on ideal quantum computer simulators. Consequently, they are not directly applicable to test quantum software on real quantum computers or noisy simulators. Thus, we propose a noise-aware approach (named $mathit{QOIN}$) to alleviate the noise effect on test results of quantum programs. $mathit{QOIN}$ employs machine learning techniques (e.g., transfer learning) to learn the noise effect of a quantum computer and filter it from a program's outputs. Such filtered outputs are then used as the input to perform test case assessments (determining the passing or failing of a test case execution against a test oracle). We evaluated $mathit{QOIN}$ on IBM's 23 noise models, Google's two available noise models, and Rigetti's Quantum Virtual Machine, with six real-world and 800 artificial programs. We also generated faulty versions of these programs to check if a failing test case execution can be determined under noise. Results show that $mathit{QOIN}$ can reduce the noise effect by more than $80%$ on most noise models. We used an existing test oracle to evaluate $mathit{QOIN}$'s effectiveness in quantum software testing. The results showed that $mathit{QOIN}$ attained scores of $99%$, $75%$, and $86%$ for precision, recall, and F1-score, respectively, for the test oracle across six real-world programs. For artificial programs, $mathit{QOIN}$ achieved scores of $93%$, $79%$, and $86%$ for precision, recall, and F1-score respectively. This highlights $mathit{QOIN}$'s effectiveness in learning noise patterns for noise-aware quantum software testing.
量子计算(QC)的计算速度有望超过传统计算。然而,近期量子计算机中存在噪声。量子软件测试(以获得对量子软件正确性的信心)不可避免地会受到噪声的影响,也就是说,无法知道测试用例失败的原因是噪声还是真正的故障。现有的测试技术在测试量子程序时不考虑噪声,即在理想的量子计算机模拟器上执行测试。因此,这些技术无法直接用于在真实量子计算机或噪声模拟器上测试量子软件。因此,我们提出了一种噪声感知方法(命名为 $mathit{QOIN}$),以减轻噪声对量子程序测试结果的影响。$mathit{QOIN}$采用机器学习技术(如迁移学习)来学习量子计算机的噪声效应,并将其从程序输出中过滤掉。过滤后的输出将作为输入,用于执行测试用例评估(根据测试甲骨文确定测试用例执行的通过或失败)。我们在 IBM 的 23 个噪声模型、Google 的两个可用噪声模型和 Rigetti 的量子虚拟机上评估了 $mathit{QOIN}$,并使用了 6 个真实程序和 800 个人工程序。我们还生成了这些程序的故障版本,以检查在噪声条件下能否确定测试用例执行失败。结果表明,在大多数噪声模型中,$mathit{QOIN}$ 可以将噪声影响降低 80% 以上。我们使用现有的测试甲骨文来评估 $mathit{QOIN}$ 在量子软件测试中的有效性。结果显示,$mathit{QOIN}$在6个真实世界程序中的测试oracle的精确度、召回率和F1分数分别达到了$99%$、$75%$和$86%$。对于人工程序,$mathit{QOIN}$ 的精确度、召回率和 F1 分数分别达到了 $93%$、$79%$ 和 $86%$。这凸显了 $mathit{QOIN}$ 在学习噪声模式以进行噪声感知量子软件测试方面的有效性。
{"title":"Mitigating Noise in Quantum Software Testing Using Machine Learning","authors":"Asmar Muqeet;Tao Yue;Shaukat Ali;Paolo Arcaini","doi":"10.1109/TSE.2024.3462974","DOIUrl":"10.1109/TSE.2024.3462974","url":null,"abstract":"Quantum Computing (QC) promises computational speedup over classic computing. However, noise exists in near-term quantum computers. Quantum software testing (for gaining confidence in quantum software's correctness) is inevitably impacted by noise, i.e., it is impossible to know if a test case failed due to noise or real faults. Existing testing techniques test quantum programs without considering noise, i.e., by executing tests on ideal quantum computer simulators. Consequently, they are not directly applicable to test quantum software on real quantum computers or noisy simulators. Thus, we propose a noise-aware approach (named \u0000<inline-formula><tex-math>$mathit{QOIN}$</tex-math></inline-formula>\u0000) to alleviate the noise effect on test results of quantum programs. \u0000<inline-formula><tex-math>$mathit{QOIN}$</tex-math></inline-formula>\u0000 employs machine learning techniques (e.g., transfer learning) to learn the noise effect of a quantum computer and filter it from a program's outputs. Such filtered outputs are then used as the input to perform test case assessments (determining the passing or failing of a test case execution against a test oracle). We evaluated \u0000<inline-formula><tex-math>$mathit{QOIN}$</tex-math></inline-formula>\u0000 on IBM's 23 noise models, Google's two available noise models, and Rigetti's Quantum Virtual Machine, with six real-world and 800 artificial programs. We also generated faulty versions of these programs to check if a failing test case execution can be determined under noise. Results show that \u0000<inline-formula><tex-math>$mathit{QOIN}$</tex-math></inline-formula>\u0000 can reduce the noise effect by more than \u0000<inline-formula><tex-math>$80%$</tex-math></inline-formula>\u0000 on most noise models. We used an existing test oracle to evaluate \u0000<inline-formula><tex-math>$mathit{QOIN}$</tex-math></inline-formula>\u0000's effectiveness in quantum software testing. The results showed that \u0000<inline-formula><tex-math>$mathit{QOIN}$</tex-math></inline-formula>\u0000 attained scores of \u0000<inline-formula><tex-math>$99%$</tex-math></inline-formula>\u0000, \u0000<inline-formula><tex-math>$75%$</tex-math></inline-formula>\u0000, and \u0000<inline-formula><tex-math>$86%$</tex-math></inline-formula>\u0000 for precision, recall, and F1-score, respectively, for the test oracle across six real-world programs. For artificial programs, \u0000<inline-formula><tex-math>$mathit{QOIN}$</tex-math></inline-formula>\u0000 achieved scores of \u0000<inline-formula><tex-math>$93%$</tex-math></inline-formula>\u0000, \u0000<inline-formula><tex-math>$79%$</tex-math></inline-formula>\u0000, and \u0000<inline-formula><tex-math>$86%$</tex-math></inline-formula>\u0000 for precision, recall, and F1-score respectively. This highlights \u0000<inline-formula><tex-math>$mathit{QOIN}$</tex-math></inline-formula>\u0000's effectiveness in learning noise patterns for noise-aware quantum software testing.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 11","pages":"2947-2961"},"PeriodicalIF":6.5,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142245650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Software Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1